LxMLS - Lab Guide 



July 15, 2014 



Day 0 

Basic Tutorials 



In this class we will introduce several fundamental concepts needed further ahead. We start with an introduc- 
tion to Python, the programming language we will use in the lab sessions, and to Matplotlib and Numpy, two 
modules for plotting and scientific computing in Python, respectively. Afterwards, we present several notions 
on probability theory and linear algebra. Finally, we focus on numerical optimization. 

The goal of this class is to give you the basic knowledge for you to understand the following lectures. We 
will not enter in too much detail in any of the topics. 

0.1 Python 

0.1.1 Python Basics 
Running Python code 

We will start by creating and running a dummy program in Python which simply prints the "Hello World!" 
message to the standard output (this is usually the first program you code when learning a new programming 
language). 

There are two main ways in which you can run code in Python: 
From a file - Create a file named your file . py and write your program in it, using your favorite text editor: 
print 'Hello World! ' 



After saving and closing the file, you can run your code by calling: 

python yourfile.py 

in the command line. This will run the program and display the message "Hello World!". After, the 
control will return to the command line. 

In the interactive command line - Start the interactive command line in Python using the command python. 
After this, you can run Python code by simply writing it and pressing enter In our lab sessions, we will 
use Python in interactive mode several times. The standard Python interface is not very friendly, though. 
IPython, which stands for interactive Python, is an improved Python shell. It saves your command history 
between sessions, has basic auto-complete, and has internal support for interacting with graphs through 
matplotlib. IPython is also designed to facilitate running parallel code on clusters of machines, but we 
will not make use of that functionality. 

To run IPython, simply tjrpe ipython on your command Irn^ For interactive numeric use, the — pylab 
flag imports numpy and matplotlib (the two libraries we will extensively use in the lab sessions) for you 
and sets up interactive graphs: 

IPython — pylab 
^Note that in some systems, e.g. Linux, you may need to run the command lower-cased. 
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You can then run Python commands in the IPython command line 



In[]: print "Hello, World!" 
Out [ ] : Hello, World! 



but you can also run Python code written into a file. 



In[] : run . /yourf ile . py 
Out [ ] : Hello, World! 



Keep in mind that you can easily switch between these two modes. You can quickly test commands in the 
command line directly and e.g. inspect variables. Larger sections of code can be stored and run from files. 

Help and Documentation 

There are several ways to get help on IPython: 

• Adding a question mark to the end of a function or variable and pressing Enter brings up associated doc- 
umentation. Unfortunately, not all packages are well documented. Numpy and matplotlib are pleasant 
exceptions; 

• help ( ' print ' ) gets the online documentation for the print keyword; 

• he Ip ( ) , enters the help system. 

• When at the help system, type q to exit. 



For more information on IPython (Perez and Granger 2007 1, check the website: http : / /ipython . scipy . 
|org/moin/| 

Exiting 

Exit IPython by typing exit { ) or quit ( ) (or t5^ing CTRL-D). 

0.1.2 Python by Example 
Data Structures 

In Python, you can create Hsts of items with the following syntax: 

countries = [' Portugal ',' Spain ',' United Kingdom'] 

A string should be surrounded with apostrophes ('). You can access a list with the following: 

• len ( L ) , which returns the number of items in L; 

• L [ i ] , which returns the item at index ; (the first item has index 0); 

• L [ i : j ] , which returns a new list, containing all the items between indexes i and j — 1, inclusive. 
Exercise 0.1 Use Lli-.j] to return the countries in the Iberian Peninsula. 
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Loops and Indentation 

A loop allows a section of code to be repeated a certain number of times, until a stop condition is reached. For 
instance, when the list you are iterating has reached its end or when a variable has reached a certain value (in 
this case, you should not forget to update the value of that variable inside the code of the loop). In Python you 
have while and for loop statements. The following two example programs output exactly the same using 
both statements: the even numbers from 2 to 8. 



i = 2 

while i < 10: 
print i 

i += 2 



for i in range (2 , 1 0 , 2 ) : 
print i 



You can copy and run this from the IPython command line. Alternatively you can write this into your 
yourf ile .py file an run it as well. Notice something? It is possible that the code did not act as expected 
or maybe an error message popped up. This brings us to an important aspect of Python: indentation. In- 
dentation is the number of blank spaces at the leftmost of each command. This is how Python differentiates 
between blocks of commands inside and outside a statement, e.g. while, for or other. All commands within 
a statement have the same number of blank spaces at their leftmost. For instance, consider the following code: 

a=l 

while a <= 3 : 
print a 

a += 1 



and its output: 

1 
2 

3 



Exercise 0.2 Can you then predict the output of the following code?: 

a=l 

while a <= 3: 
print a 

a += 1 



Bear in mind that indentation is often the main source of errors when starting to work with Python. Try to 
get used to it as quickly as possible. It is also recommendable that you use a text editor that can display all 
characters e.g. blank space, tabs, since these characters can be visually similar but are considered different by 
Python. One of the most common mistakes by newcomers to Python is to have their files indented with spaces 
on some lines and with tabs on other lines. Visually it might appear that all lines have proper indentation, but 
you will get an IndentationError message if you try it. 

Control Flow 

The if statement allows to control the flow of your program. The next program outputs a greeting that 
depends on the time of the day. 
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hour = 15 






if hour < 


12 : 




print 


'Good 


morning ! ' 


elif hour 


>= 12 


and hour < 20 : 


print 


'Good 


afternoon ! ' 


else : 






print 


'Good 


evening ! ' 



Functions 



A function is a block of code that can be reused to perform a similar action. The following is a function in 
Python. 



def greet (hour 


) : 




if hour < 


12 : 




print 


'Good 


morning ! ' 


elif hour 


>= 12 


and hour < 20 : 


print 


'Good 


afternoon ! ' 


else : 






print 


'Good 


evening ! ' 



You can write this command into IPython interactive command line directly or write them into a file and 
run the file in IPython. Once you do this the function will be available for you to use. Call the function 
greet with different hours of the day (for example, type greet (16)) and see that the program will greet you 
accordingly. 



Exercise 0.3 Note that the previous code allows the hour to be less than 0 or more than 24. Change the code in order to 
indicate that the hour given as input is invalid. Your output should be something like: 



greet (50) 














Invalid hour: 


it 


should 


be 


between 


0 


and 24. 


greet (-5) 














Invalid hour: 


it 


should 


be 


between 


0 


24. 



Profiling 

If you are interested in checking the performance of your program, you can use the command %prun in 
IPython (this is an IPython-only feature). For example: 

def myf unction (x) : 

%prun myf unction (22 ) 



The output of the %prun command will show the following information for each function that was called 
during the execution of your code: 

• ncall s: The number of times this function was called. If this function was used recursively, the output 
will be two numbers; the first one counts the total function calls with recursions included, the second 
one excludes recursive calls. 

• tottime: Total time spent in this function, excluding the time spent in other functions called from within 
this function. 

• percall: Same as tottime, but divided by the number of calls. 
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• cumtime: Same as tottime, but including the time spent in other functions called from within this 
function. 



• percall: Same as cumtime, but divided by the number of calls. 

• filename : lineno ( function) : Tells you where this function was defined. 

Debugging in Python 

During the lab sessions we will use the previously described IPython iterative command line which allows 
you to execute a script, command by command. This should limit the need for debugging tools. However, 
there will be situations in which we will use and extend modules that involve more elaborated code and 
statements, like classes and nested functions. Although desirable, it should not be necessary for you to fully 
understand the whole code to carry out the exercises. It will suffice to understand the algorithm as explained 
in the theoretical part of the class and the local context of the part of the code where we will be working. 

The simplest way to do this is to run the code and stop the execution at a given point (called break-point) 
to get a quick glimpse of the variable structures and to inspect the execution flow of your program. For that, 
you can use the ipdb module. 

In the following example, we use this module to inspect the greet function: 



def greet (hour) 






if hour < 1 


2 : 




print ' 


Good 


morning ! ' 


elif hour > 


= 12 


and hour < 20 : 


print ' 


Good 


afternoon ! ' 


else : 






import 


ipdb, 


ipdb . set_trace ( ) 


print ' 


Good 


evening ! ' 



Load the new definition of the function into IPython by writting this code in a file and running it. Now, 
if you try greet (50) the code execution should stop at the place where you located the break-point (that 
is, in the print ' Good evening! ' statement). You can now run new commands or inspect variables. 
For this purpose there are a number of commands you can use. The complete list can be found at http :] 
[/^/docs . python . org/ library/pdb . html but we provide here a short table with the most useful: 



(h)elp 
(p)rint 

(p)retty(p)rint 


Starts the help menu 
Prints a variable 

Prints a variable, with line break (useful for lists) 


(n)ext line 
(s)tep 

c(ont(rnue)) 
(r)eturn 
b(reak) n 


Jumps to next line 

Jumps inside of the function we stopped at 
Continues execution until finding breakpoint or finishing 
Continues execution until current function returns 
Sets a breakpoint in in line n 


l(ist) [n], [m] 

w(here) 

u(p) 

d(down) 


Prints 11 lines around current line. Optionally starting in line n or between lines n, m 
Shows which function called the function we are in, and upwards (staclj^ 
Goes one level up the stack (frame of the function that called the function we are on) 
Goes one level down the stack 


blank 
expression 


Repeat the last command 

Executes the python expression as if it was in current frame 



Table 1: Basic pdb/ipdb commands, parentheses indicates abbreviation 
So getting back to our example, we can type n(ext) once to execute the line we stopped at 



ipdb> n 






> . / Ixmis- 


-toolkit/yourfile.py(J 


i ) greet ( ) 


7 


import 


ipdb; ipdb . set_trace ( ) 


> 8 


print 


Good evening ! ' 



^Note that since we are inside the IPython command line, the IPython functions will also appear at the top. 
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Now we can inspect the variable hour using the p(retty)p(rint) option 



ipdb> pp hour 
50 



From here we could keep advancing with the n(ext) option or set a b(reak) point and type c(ontinue) to 
jump to a new position. We could also execute any python expression which is valid in the current frame 
(the function we stopped at). This is particularly useful to find out why code crashes, as we can try different 
alternatives without the need to restart the code again. 

0.1.3 Exceptions 

Occasionaly, a S5mtactically correct code statement may produce an error when an attempt is made to execute 
it. These kind of errors are called exceptions in Python. For example, try executing the following: 

10/0 



A ZeroDivisionError exception was raised, and no output was returned. Exceptions can also be forced 
to occur by the programmer, with customized error messages (for a complete list of built-in exceptions, see 
http : //docs . python . org/ 2 /library/ exceptions . html) . 

raise ValueError (" Invalid input value.") 



Exercise 0.4 Rewrite the code in Exercise 0.3 in order to raise a ValueError exception when the hour is less than 0 or 
more than 24. 



Handling of exceptions is made with the try statement: 



while True : 




try: 




X = int (raw_input ( "Please enter a number: ") 




break 




except ValueError : 




print "Oops! That was no valid number. Try again. 


II 



It works by first executing the try clause. If no exception occurs, the except clause is skipped; if an exception 
does occur, and if its type matches the the exception named in the except keyword, the except clause is executed; 
otherwise, the exception is raised and execution is oborted (if it is not caught by outer try statements). 

Extending basic Functionalities with Modules 

In Python you can load new functionalities into the language by using the import, from and as keywords. 
For example we can load the numpy module as 

import numpy as np 



then we can run the following on the IPython command line 

np . var ? 

np . random. normal ? 



The import will make the numpy tools available through the alias np. This shorter alias prevents the code 
from getting too long if we load lots of modules. The first command will display the help for the method 
numpy . var using the previously commented symbol ?. Note that in order to display the help you need the 
full name of the function including the module name or alias. Modules have also submodules that can be 
accessed the same way, as shown in the second example. 
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interactively or save them to a file (several output formats are supported). 

import numpy as np 

import matplotlib .pYplot ~~ pit 

X = np. linspace (-4, 4, 1000) 

plt.plot(X, X**2 *np . cos (X**2 ) ) 
pit . save fig ( "simple . pdf " ) 



Exercise 0.5 Try running the following on IPython, which will introduce you to some of the basic numeric and plotting 
operations. 

# This will import the numpy library 

# and give it the np abbreviation 
import numpy as np 

# This will import the plotting library 
import matplotlib .pyplot as pit 

# Linspace will return 1000 points, 

# evenly spaced between -4 and +4 
X = np. linspace (-4, 4, 1000) 

# Y[i] = X[i]**2 
Y = X**2 

# Plot using a red line ('r') 
pit. plot (X, Y, 'r') 

# arange returns integers ranging from -4 to +4 

# (the upper argument is excluded!) 
Ints = np . arange (-4 , 5) 

# We plot these on top of the previous plot 

# using blue circles (o means a little circle) 
pit .plot (Ints, Ints**2, 'bo') 

# You may notice that the plot is tight around the line 

# Set the display limits to see better 
pit . xlim (-4 . 5, 4.5) 

plt.ylim (-1,1 7) 
pi t . show ( ) 



0.1.5 Numpy - Scientific Computing with Python 



Multidimensional Arrays 

The main object of numpy is the multidimensional array. A multidimensional array is a table with all elements 
of the same type and can have several dimensions. Numpy provides various functions to access and manipu- 
late multidimensional arrays. In one dimensional arrays, you can index, slice, and iterate as you can with lists. 
In a two dimensional array M, you can use perform these operations along several dimensions. 

■ http : //matplotlib . org/ 
^ http : / /www . numpy . org/ 




is a library for scientific computing with Python. 
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• M[i,j], to access the item in the i row and j ' column; 

• M[i:j,:], to get the all the rows between the i^^ and / — 1^''; 

• M[:,i], to get the f"' column of M. 

Again, as it happened with the lists, the first item of every column and every row has index 0. 

import numpy as np 
A = np . array ( [ 

[1,2,3] , 

[2,3,4] , 

[4,5,6]]) 

A[0, : ] # This is [1, 2, 3] 

A[0] # This is [1,2,3] as well 

A[ : , 0] # this is [1, 2, 4] 

A[1:,0] # This is [ [2], [4] ]. Why? 

# Because it is the same as A[l:n,0] where n is the size of the array. 



IVIathematical Operations 



There are many helpful functions in numpy. For basic mathematical operations, we have np . log, np . exp, 
np . cos,. . .with the expected meaning. These operate both on single arguments and on arrays (where they 
will behave element wise). 



import matplotlib . pyplot 


■-. pit 


import numpy as np 




X = np . linspace ( 0 , 4 * np 


pi, 1000) 


C = np . cos (X) 




S = np . sin (X) 




plt.plot(X, C) 




plt.plot(X, S) 





Other functions take a whole array and compute a single value from it. For example, np . sum, np . mean,. . . These 
are available as both free functions and as methods on arrays. 

import numpy as np 
A = np . arange ( 100 ) 

# These two lines do exactly the same thing 
print np.mean(A) 
print A. mean ( ) 

C = np . cos (A) 
print C .ptp ( ) 



Exercise 0.6 Run the above example and lookup the ptp function/method (use the ? functionality in IPython). 
Exercise 0.7 Consider the following approximation to compute an integral 



L 



0 ^ ,^0 1000 ■ 



Use numpy to implement this for f{x) = x^. You should not need to use any loops. Note that integer division in 
Python 2.x returns the floor division (use floats - e.g. 5.0/2.0 - to obtain rationals). The exact value is 1/3. How close 
is the approximation ? 
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0.2 Essential Linear Algebra 

Linear Algebra provides a compact way of representing and operating on sets of linear equations. 



Axi —5x2 = — 13 
— 2xi +3x2 = 9 

This is a system of linear equations in 2 variables. In matrix notation we can write the system more com- 
pactly as 

Ax = b 



with 



A = 



"4 -5 ■ 




■ -13 " 


-2 3 
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0.2.1 Notation 

We use the following notation: 

• By A E W" ^ ", we denote a matrix with m rows and n columns, where the entries of A are real numbers. 

• By X E ]R", we denote a vector with n entries. A vector can also be thought of as a matrix with n rows 
and 1 column, known as a column vector. A row vector — a matrix with 1 row and n columns is denoted 
as x^, the transpose of x. 

• The fth element of a vector x is denoted x. 



X2 



Exercise 0.8 In the rest of the school we will represent both matrices and vectors as numpy arrays. You can create arrays 
in different ways, one possible way is to create an array of zeros. 



import 


numpy as 


np 


m = 3 






n = 2 






a = np. 


zeros ( [m. 


n]) 


print a 






[[ 0. 


0.] 




[ 0. 


0.] 




[ 0. 


0.]] 





You can check the shape and the data type of your array using the following commands: 

print a. shape 
(3, 2) 

print a . dtype . name 
float64 



This shows you that "a" is an 3*2 array of type float64. By default, arrays contain 64 bi^^oating point numbers. You 
can specify the particular array type by using the keyword dtype. 



a = np . zeros ( [m, n] , dtype=int) 

print a . dtype 

int64 



^On your computer, particularly if you have an older computer, int might denote 32 bits integers 
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You can also create arrays from lists of numbers: 



a = np.array([[2,3], [3,4]]) 
print a 

[[2 3] 
[3 4]] 



There are many more ways to create arrays in numpy and we will get to see them as we progress in the classes. 

0.2.2 Some Matrix Operations and Properties 

• Product of two matrices A e R'"^" and B e ]R"''P is the matrix C = AB e ]R"'^P, where 



k=l 

Exercise 0.9 You can multiply two matrices by looping over both indexes and multiplying the individual entries. 



a = np. array ( [ [2, 3] , [3, 4] ] ) 
h = np.array([[l,l], [1,1]]) 
a_diml, a_dim2 = a. shape 
b_diml, b_dim2 = b . shape 
c = np . zeros ( [a_diml , b_dim2 ] ) 
for i in xrange (a_diml) : 
for j in xrange (b_dim2 ) : 

for k in xrange (a_dim2 ) : 
c[i,j] += a [i, k] *b [k, j ] 

print c 



This is, however, cumbersome and inefficient. Numpy supports matrix multiplication with the dot function: 



d = np . dot (a, b) 
print d 



Important note: with numpy, you must use dot to get matrix multiplication, the expression a * b denotes 
element-wise multiplication. 

• Matrix multiplication is associative: {AB)C = A{BC). 

• Matrix multiplication is distributive: A{B + C) = AB + AC. 

• Matrix multiplication is (generally) not commutative : AB ^ BA. 

• Given two vectors x,y e IR" the product x^y, called inner product or dot product, is given by 



x^y e IR = [ X2 



yi 
yi 

yn 



E ^iyi- 



a = np . array ( [ 1 , 2 ] ) 
b = np . array ([ 1 , 1 ] ) 
np . dot ( a, b) 
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• Given vectors x £ W and y G R", the outer product xy^ £ jj^mxn ^ matrix whose entries are given by 



Xi 
X2 



[ yi y2 ■ ■ ■ Vn ] 



xiyi xii/2 ■■■ ^iy« 
X2yi X2y2 ... X2y„ 

Xjiiyi X,j;l/2 . . . ^myn 



np . outer (a, b) 
array ( [ [1, 1] , 
[2, 2] ] ) 



• The identity matrix, denoted I E W ^", is a square matrix with ones on the diagonal and zeros every- 
where else. That is. 

It has the property that for all A e R"^", A7 = A = I A. 

I = np . eye ( 2 ) 

X = np . array ( [ 2 . 3 , 3.4]) 

print I 

print np.dot(I,x) 

[[ 1., 0.], 
[ 0., 1.]] 
[2.3, 3.4] 



• A diagonal matrix is a matrix where all non-diagonal elements are 0. 

• The transpose of a matrix results from '"flipping" the rows and columns. Given a matrix A E ]R™^", the 
transpose A''^ E ]R"^'" is the n x m matrix whose entries are given by (A^),y = Ajj. 

Also, = (AB)^ = B^A^; {A + = A'^ + B'^ 

In numpy you can access the transpose of a matrix as the T attribute: 



A = np . array ( [ 


[1, 2], 


[3, 4] ]) 


print A . T 







• A square matrix A E R"^" is symmetric if A = A^. 

n 

• The trace of a square matrix A E R"'^" is the sum of the diagonal elements, tr(A) = X] 

i=l 

0.2.3 Norms 

The norm of a vector is informally the measure of the "length" of the vector. The commonly used Euclidean 
or £2 norm is given by 



X\\2 = 




• More generally, the £p norm of a vector x E R", where p > 1 is defined as 



n 

Note: £-1 norm : ||x||i = Yj ^00 norm : ||x||oo = max;|x/ 

1=1 
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0.2.4 Linear Independence, Rank, and Orthogonal Matrices 

A set of vectors {xi,X2, ■ ■ ■ ,Xn} C R*" is said to be (linearly) independent if no vector can be represented as a 

linear combination of the remaining vectors. Conversely, if one vector belonging to the set can be represented 
as a linear combination of the remaining vectors, then the vectors are said to be linearly dependent. That is, if 

for some e {1, . . . , n} and some scalar values oci,..., a,_i, af+i, ...,oin £ IR- 

• The rank of a matrix is the number of linearly independent columns, which is always equal to the nirmber 

of linearly independent rows. 

• For A e Ji^x"^ rank(A) < min(m, n). If rank(A) =min(m, n), then A is said to be full rank. 

• For A e ]?'"^'',rank(A)=rank(A^). 

• For A e R*"^", B e R^^'P, rank(AB) < min(rank(A),rank(B)). 

• For A, B e K'"^", rank(A + B) < rank(A) + rank(B). 

• Two vectors x,y G M" are orthogonal if x^y = 0. A square matrix U e R"^" is orthogonal if all its 
colirams are orthogonal to each other and are normalized (||3c||2 = 1), It follows that 

0.3 Probability Theory 

Probability is the mathematical language for quantifying uncertainty. The sample space X is the set of possible 
outcomes of an experiment. Events are subsets of X. 

Example 0.1 (discrete space) Let H denote "heads" and T denote "tails." If we toss a coin twice, then X — {HH, HT, TH,TT}. 
The event that the first toss is heads is A — {HH, HT}. 

Sample space can also be continuous (eg., X — K). The imion of events A and B is defined as A IJ B = {a; G 

n 

X I o; e A V o; e B}. If Ax, . . ., A« is a sequence of sets then [j Aj — {w & X \ a? £ Ai for at least one i}. We 

1=1 

say that Ai, . . ., A„ are disjoint or mutually exclusive if A,- DAj — 0 whenever i ^ 

We want to assign a real niraiber P(A) to every event A, called the probability of A. We also call P a 
probability distribution or probability measure. 



Definition 0.1 A function P that assigns a real number P{A) to each event A is a probability distribu- 
tion or a probability measure if it satisfies the three following axioms: 

Axiom 1: -P(A) > 0 for every A 
Axiom 2: P(X) = 1 

Axiom 3: If A\, . .., An are disjoint then 



One can derive many properties of P from these axioms: 



P(0) = 0 

ACS P(A) < P(B) 

0< P(A) <1 

P(A') = 1-P{A) (A' is the complement of A) 

P(AUB) = P{A)+P{B)-P{A^B) 

AnB = (^ ^ P{AUB)^P{A)+P{B). 
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An important case is when events are independent, this is also a usual approximation which lends several 
practical advantages for the computation of the joint probability. 



Definition 0.2 Two events A and B are independent if 

P{AB) = P{A)P{B) (1) 
often denoted as A JL B. A set of events {A, : f e 7} is independent if 



for every finite subset } of I. 



For events A and B, where P{B) > 0, the conditional probability of A given that B has occurred is defined 

as: 

Events A and B are independent if and only if P{A\B) = P{A). This follows from the definitions of 
independence and conditional probability. 

A preliminary result that forms the basis for the famous Bayes' theorem is the law of total probability which 
states that if Ai, . . . , Aj^ is a partition of X, then for any event B, 

k 

P{B) = J2P{B\A,)P{At). (3) 
Using Equations |2] and |3| one can derive the Bayes' theorem. 



Theorem 0.1 (Bayes' Theorem) Let A\, ■■■A\^ he a partition of X sudi that -P(A;) > 0/or each i. If 
P{B) > 0 then, for each i = l,...,k, 



Remark 0.1 P{Ai) is called the prior probability o/A; and P(A, |B) is the posterior probability o/A/. 

Remark 0.2 In Bayesian Statistical Inference, the Bayes' theorem is used to compute the estimates of distribution pa- 
rameters from data. Here, prior is the initial belief about the parameters, likelihood is the distribution function of the 
parameter (usually trained from data) and posterior is the updated belief about the parameters. 



0.3.1 Probability distribution functions 

A random variable is a mapping X : X ^ ]R that assigns a real number X(a;) to each outcome oj. Given a ran- 
dom variable X, an important function called the cumulative distributive function (or distribution function) 
is defined as: 

Definition 0.3 The cumulative distribution function CDF Fx : K ^ [0,1] of a random variable X is defined by 
Fxix)=P{X<x). 

The CDF is important because it captures the complete information about the random variable. The CDF 
is right-continuous, non-decreasing and is normalized (limx^-t»F(x) — 0 and limx^ooF(x) = 1). 

Example 0.2 (discrete CDF) Flip a fair coin twice and let X be the random variable indicating the number of heads. 
Then P{X = 0) = P(X = 2) = 1/4 and P{X = 1) = 1/2. The distribution function is 

X <0 

0 < X < 1 

1 < X < 2 
X > 2. 



Fx{x) 
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Definition 0.4 X is discrete if it takes countable many values {x\,X2, ■ . .}. We define the probability function or 
probability mass function /or X hy 

/x(x)=P(X = x). 

Definition 0.5 A random variable X is continuous if there exists a function fx such that fx > Ofor all x, 

oo 

/ fxix)dx — 1 and for every a <b 

— 00 

b 

P{a<X<b)^ j fx{x)dx. (5) 
a 

The function fx is called the probability density function (PDF). We have that 

X 

Mx) = / fx{t)dt 

—oo 

a)h1 fx{x) — f'x{^) poijits X at ii'liidi fx is diffcrcjitinhh'. 
A discussion of a few importajnt distributions and related properties: 

0.3.2 Bernoulli 

The Bernoulli distribution is a discrete probability distribution that takes value 1 with the success probability 
p and 0 with the failure probability q = 1 — p. A single Bernoulli trial is parametrized with the success 
probability p, and the input k e {0, 1} (l=success, 0=failure), and can be expressed as 

0.3.3 Binomial 

The probability distribution for the number of successes in n Bernoulli trials is called a Binomial distribution, 
which is also a discrete distribution. The Binomial distribution can be expressed as exactly successes is 

/0>;p) = ( ^ ) pic,-' = ( ^ ) p^(l - pr-i 

where n is the number of Bernoulli trials with probability p of success on each trial. 

0.3.4 Categorical 

The Categorical distribution (often conflated with the Multinomial distribution, in fields like Natural Lan- 
guage Processing) is another generalization of the Bernoulli distribution, allowing the definition of a set of 
possible outcomes, rather than simply the events "success" and "failure" defined in the Bernoulli distribution. 
Considering a set of outcomes indexed from 1 to n, the distribution takes the form of 

f{xi;pi,...,p„) = Pi. 

where parameters pi,...,p„ is the set with the occurrence probability of each outcome. Note that we must 
ensure that Yj"=\ Pi = 1/ so we can set p„ = 1 — Tlj^i Pi- 

0.3.5 Multinomial 

The Multinomial distribution is a generalization of the Binomial distribution and the Categorical distribution, 

since it considers multiple outcomes, as the Categorial distribution, and multiple trials, as in the Binomial 
distribution. Considering a set of outcomes indexed from 1 to n, the vector xi, ...,x„, where x, indicates the 
number of times the event with index i occurs, follows the Multinomial distribution 

f{xi,...,Xn;Vi,...,Vn) = ^ , Pi-Pn"- 
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Where parameters pi, p„ represent the occurrence probability of the respective outcome. 
0.3.6 Gaussian Distribution 

A very important theorem in probability theory is the Central Limit Theorem. The Central Limit Theorem 
states that, under very general conditions, if we sum a very large number of mutually independent random 
variables, then the distribution of the sum can be closely approximated by a certain specific continuous density 
called the normal (or Gaussian) density. The normal density fimction with parameters }i and a is defined as 
follows: 

/x(x) = -^^e-i^-l'f'^^^ - CO < X < CO. 
^/2no■ 




Figure 1: Normal density for two sets of parameter values. 



Figure[T]compares a plot of normal density for the cases }i = Q and u = 1, and }i = Q and (7 = 2. 
0.3.7 Maximum Likelihood Estimation 

Until now we assumed that, for every distribution, the parameters 9 are known and are used when we calculate 
There are some cases where the values of the parameters are easy to infer, such as the probability p of 
getting a head using a fair corn, used on a Bernoulli or Binomial distribution. However, in many problems, 
these values are complex to define and it is more viable to estimate the parameters using the data x. For 
instance, in the example above with the coin toss, if the coin is somehow tampered to have a biased behavior, 
rather than examining the dynamics or the structure of the coin to infer a parameter for p, a person could 
simply throw the coin n times, count the number of heads h and set p = | . By doing so, the person is using 
the data x to estimate 9. 

With this in mind, we will now generalize this process by defining the probability p{9\x) as the probability 
of the parameter 6, given the data x. This probability is called likelihood L[Q\x) and measures how well the 
parameter 9 models the data x. The likelihood can be defined in terms of the distribution / as 

J^{9\xi x„)=flf{xi\9) 

1=1 

where xj, x„ are independently and identically distributed (i.i.d.) samples. 

To understand this concept better, we go back to the tampered corn example again. Suppose that we throw 
the coin 5 times and get the sequence [1,1,1,1,1] (l=head, 0=tail). Using the Bernoulli distribution (see Sec- 
tion p 



0.3.2 1 / to model this problem, we get the following likelihood values: 

L{0,x) =/(l,0)5 =0^ = 0 

£(0.2,x) = /(1, 0.2)5 ^ 0.2^ = 0.00032 
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= /(1, 0.4)5 


= 0.45 


= 0.01024 


C{0.6,x) 


= /(1, 0.6)5 


= 0.6^ 


= 0.07776 


L{0.8,x) 


= /(1,0.8)5 


= 0.35 


= 0.32768 



• £(l,x) =/(l,l)5 = l5 = l 

If we get the sequence [1,0,1/1/0] instead, the likelihood values woiild be: 



£(0,x) = 


= /(l,0)3/(0,0)2 = 0^X12 = 


0 




£(0.2, a:) 


= /(1, 0.2)3/(0, 0.2)2 


= 0.23 


xO.82 


= 0.00512 


£(0.4,;c) 


= /(1, 0.4)3/(0, 0.4)2 


= 0.43 


xO.62 


= 0.02304 


£(0.6,;c) 


= /(1, 0.6)3/(0, 0.6)2 


= 0.63 


xO.42 


= 0.03456 


£(0.8,x) 


= /(l,0.8)3/(0,0.8)2 


= 0.83 


xO.22 


= 0.02048 


L{l,x) = 


= /(l,l)5 = l3 X o2 = 


0 







We can see that the likelihood is the highest when the distribution / with parameter p is the best fit for the 
observed samples. Thus, the best estimate for p according to x would be the value for which L{p,x) is the 
highest. 

The value of the parameter 6 with the highest likelihood is called maximum likelihood estimate (MLE) 

and is defined as 

^mle = argmax0Z{e\x) 

Finding this for our example is relatively easy, since we can simply derivate the UkeUhood function to find 
the absolute maximum. For the sequence [1,0,1,1,0], the UkeHhood woiild be given as 

L{p,x)^f{l,pff{Q,pf^p^{l-pf 

And the MLE estimate would be given by: 

5p 

which resolves to 

PmU = 0.6 

Exercise 0.10 Over the next couple of exercises we will make use of the Galton dataset, a dataset of heights of fathers 
and sons from the 1877 paper that first discussed the "regression to the mean" phenomenon. This dataset has 928 pairs 
of numbers. 

• Use the load 0 function in the gal t on . pyfile to load the dataset. The file is located under the Ixml s/readers 
folder. Type the following in your Python interpreter: 

import galton as galton 
GaltonData = galton . load ( ) 

• Yihat are the mean height and standard deviation of all the people in the sample? What is the mean height of the 
fathers and of the sons? 

• Plot a histogram of all the heights (you might want to use the pit. hist function and the ravel method on 
arrays). 

• Plot the height of the father versus the height of the son. 

• You should notice that there are several points that are exactly the same (e.g., there are 21 pairs with the values 68.5 
and 70.2). Use the ? command in ipython to read the documentation for the numpy . random, randn function 
and add random jitter (i.e., move the point a little bit) to the points before displaying them. Does your impression 
of the data change? 
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0.3.8 Conjugate Priors 

Definition 0.6 let 3" = {fx{x\s),s e X} be a class of likelihood functions; let 7 be a class of probability (density or 
mass) functions; if for any x, any ps{s) G f, and any fx{x\s) £ 5", the resulting a posteriori probability function 
ps{s\x) = fx{x\s)ps{s) is still in T, then 7 is called a conjugate family, or a family o/ conjugate priors, /or 5". 



0.4 Numerical optimization 

Most problems in machine learning require minimization/ maximization of functions (likelihoods, risk, energy, 
entropy, etc.,)- Let x* be the value of x which minimizes the value of some function f{x). Mathematically, this 
is written as 



argmin/(x) 

X 



In a few special cases, we can solve this minimization problem analytically in closed form (solving for 
optimal X* in \/xf{x*) = 0), but in most cases it is too cumbersome (or impossible) to solve these equations 
analytically, and they must be tackled numerically. In this section we will cover some basic notions of numer- 
ical optimization. The goal is to provide the intuitions behind the methods that will be used in the rest of the 



school. There are plenty of good textbooks in the subject that you can consult for more information ( Nocedal 



and Wright} |1999||Bertsekas et aH[T995||Boyd and Vandenberghe}|2004^ . 

The most common way to solve the problems when no closed form solution is available is to resort to an 
iterative algorithm. In this Section, we will see some of these iterative optimization techniques. These iterative 
algorithms construct a sequence of points x^^\ x^^\ ... £ domain (/) such that hopefully x* = x* after a number 
of iterations. Such a sequence is called the minimizing sequence for the problem. 



0.4.1 Convex Functions 

One important property of a function f{x) is whether it is a convex function (in the shape of a bowl) or a 
non-convex function. Figures |2] and |3] show an example of a convex and a non-convex function. Convex 
functions are particularly useful since you can guarantee that the minimizing sequence converges to the true 
global minimum of the function, while in non-convex functions you can only guarantee that it will reach a 
local minimum. 

Intuitively, imagine dropping a ball on either side of Figure |2| the ball will role to the bottom of the bowl 
independently from where it is dropped. This is the main benefit of a convex function. On the other hand, if 
you drop a ball from the left side of Figure |3] it will reach a different position than if you drop a ball from its 
right side. Moreover, dropping it from the left side will lead you to a much better {i.e., lower) place than if you 
drop the ball from the right side. This is the main problem with non-convex functions: there are no guarantees 
about the quality of the local minimum you find. 

More formally, some concepts to understand about convex functions are: 
A line segment between points Xi and X2: contains all points such that 

X = 9xi + (1 — 0)X2 

where 0 <0 <1. 

A convex set contains the line segment between any two points in the set 

Xi,X2&C, 0<e<l => 0xi + (1 - 0)x2 e C 



A fimction / : R" — )• K is a convex function if the domain of / is a convex set and 

f{ex + {i-e)y)<ef{x) + {i-e)f{y) 

for all X, 1/ e domain of /, 0 < 0 < 1 



0.4.2 Derivative and Gradient 

The derivative of a function is a measure of how the function varies with its input variables. Given an interval 
[a, b] one can compute how the function varies within that interval by calculating the average slope of the 
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Figure 2: Illustration of a convex function. The line segment between any two points on the graph lies entirely 
above the curve. 




Figure 3: Illustration of a non-convex function. Note the line segment intersecting the curve. 



fimction in that interval. 



m-fia) 



(6) 



The derivative can be seen as the limit as the interval goes to zero, and it gives us the slope of the function at 
that point. 

'l=,^f±±R^ (7) 
dx h=o h 

Table|2]shows derivatives of some fimctions that we will be using during the school. 



Function f{x) 


Derivative |^ 




Ix 


x" 


nx"-^ 


log(x) 


1 

X 


exp(x) 


exp(x) 


1 

X 


1 

x2 



Table 2: Some derivative examples 
An important rule of derivation is the chain rule. Consider h —fog, and u = g{x), then: 



dh _dl dg 
dx du dx 



(8) 



Example 0.3 Consider the function h{x) = exp(x^), this can be decomposed as h{x) = f{g{x)) = f{u) = exp(M), 
where u = g{x) = x^ and has derivative = ■ |^ = exp(M) ■ 2x = exp(x^) ■ 2x 

Exercise 0.11 Consider the function f{x) = x^ and its derivative ^. Look at the derivative of that function at points [- 
2,0,2 ], draw the tangent to the graph in that point ^ ( — 2) = —4, g| (0) =0, and ^ (2) = 4. For example, the tangent 
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equation for x = —2 is y = — 4x — b, where b = /(— 2). The following code plots the function and the derivatives on 
those points using matplotlib (See Figure^. 



a = np . arange (-5, 5, 0 . 01 ) 
f_x = np .power (a, 2 ) 
pit .plot (a, f_x) 

pit . xlim (-5, 5) 
plt.ylim (-5,15) 

k= np . array ( [-2, 0,2]) 
pit. plot (k,k**2, "bo") 
for i k : 

pit. plot (a, (2*i)*a - (i**2) ) 
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10 



5 



0 



-4 -2 0 2 4 

Figure 4: Illustration of the gradient of the function /(x^) at three different points x = [—2,0.2]. Note that at 
point X = 0 the gradient is zero which corresponds to the minimum of the function. 



The gradient of a function is a generalization of the derivative concept we just saw before for several 
dimensions. Lets assume we have a function f[x) where x G M^, so x can be seen as a pair x — X\, Xi- Then, 
the gradient measures the slope of the fimction in both directions: V xf{^) = ^] ■ 

0.4.3 Gradient Based Methods 

Gradient based methods are probably the most common methods used for finding the minimizing sequence 
for a given function. The methods used in this class will make use of the function value f{x) as well as the 
gradient of the function Vi-/(x). The simplest method is the Gradient descent method, an unconstrained 
first-order optimization algorithm. 

The intuition of this method is as follows: You start at a given point xq ^rid compute the gradient at that 
point V.v,,/(x). You then take a step of length r] on the direction of the negative gradient to find a new point: 
X\ = xq — rjVxcfix). Then, you compute the gradient at this new point, Vjj/(x), and take a step of length rj 
on the direction of the negative gradient to find a new point: X2 = Xi — 7/Vxi/(x). You proceed until you have 
reached a minimum (local or global). Recall from the previous subsection that you can identify the minimum 
by testing if the norm of the gradient is zero: 1 1 V/(x) 1 1 = 0. 

There are several practical concerns even with this basic algorithm to ensure both that the algorithm con- 
verges (reaches the minimum) and that it does so in a fast way (by fast we mean the number of function and 
gradient evaluations). 
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Step Size rj A first question is how to find the step length rj. One condition is that eta should guarantee 
sufficient decrease in the function value. We will not cover these methods here but the most common 
ones are Backtracking line search or the Wolf Line Search ( [Nocedal and Wright 1999| . 



Descent Direction A second problem is that using the negative gradient as direction can lead to a very 
slow convergence. Different methods that change the descent direction by multiplying the gradient by 
a matrix /5 have been proposed that guarantee a faster convergence. Two notable methods are the Con- 



jugate Gradient (CG) and the Limited Memory Quasi Newton methods (LBFGS) (Nocedal and Wright 

• Stopping Criteria Finally, it will normally not be possible to reach full convergence either because it will 
be too slow, or because of numerical issues (computers cannot perform exact arithmetic). So normally 
we need to define a stopping criteria for the algorithm. Three common criteria (that are normally used 
together) are: a maximum number of iterations; the gradient norm be smaller than a given threshold 
||V/(x)|| < t]i, or the normalized difference in the function value be smaller than a given threshold 
\fM-f{-,-i)\ < „^ 

max(|/(x,)U/(.r,_i)l) " 

Algorithm [T] shows the general gradient based algorithm. Note that for the simple gradient descent al- 
gorithm /5 is the identity matrix and the descent direction is just the negative gradient of the fimction, fi = 
- V/(x). Figure [s] shows an illustration of the gradient descent algorithm. 



Algorithm 1 Gradient Descent 
1: given a starting point xq, i = 0 
2: repeat 

3: Compute step size r] 

4: Compute descent direction — /5V/(x,). 

5: <— Xi — t]fi\7f{Xi) 

6: i + 1 

7: until stopping criterion is satisfied. 



Exercise 0.12 Consider the function f{x) = (x -|- 2)^ — 16exp ( — (x — 2)^). Make a function that computes the 
function value given x. 

def get_y (x) : 

return (x+2)**2 - 1 6 *np . exp (- ( (x-2 ) * *2 ) ) 



Draw a plot around x e [—8,8]. 



X = 


np . arange (-8, 


3, 0.001) 


y = 


map ( lambda u : 


get_y(u) , x) 


pit 


plot (x,y) 




pit 


show ( ) 





Calculate the derivative of the function f{x), implement the function get_grad(x). 



def get_grad(x} : 

return (2 *x+4 ) -1 6 * (-2 *x + 4 ) *np . exp (- ( (x-2 ) **2 ) ) 



Use the method gradient_descent to find the minimum of this function. Convince yourself that the code is doing the 
proper thing. Look at the constants we defined. Note, that we are using a simple approach to pick the step size (always 
have the value stepjize) which is not necessarily correct. 

def grad±ent_descent (start_x, func, grad) : 
# Precision of the solution 
prec = 0. 0001 

iUse a fixed small step size 
step_size = 0.1 
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Figure 5: Illustration of gradient descent. The blue circles correspond to contours of the function (each blue 
circle is a set of points which have the same function value), while the red lines correspond to steps taken in 
the negative gradient direction. 



imax iterations 
max_iter = 100 
x_new = start_x 
res = [] 

for i in xrange (max_iter) : 
x_old = x_new 

#Use beta equal to -1 for gradient descent 

x_new = x_old - step_size * get_grad (x_new) 

f_x_new = get_y (x_new) 

f_x_old = get_y (x_old) 

res . append ( [x_new, f_x_new] ) 

if (abs (f_x_new - f_x_old) < prec) : 



print "change in function values too small, leaving" 
return np . array (res) 



print "exceeded maximum number of iterations, leaving" 
return np . array (res ) 



Run the gradient descent algorithm starting from xq = —8 and plot the minimizing sequence. 



x_0 = -8 

res = gradient_descent (x_0, get_y, get_grad) 
pit .plot (res [ : , 0] , res [ : , 1] , '+' ) 
pit . show ( ) 



Figure^shozvs the resulting minimizing sequence. Note that the algorithm converged to a minimum, but since the 

function is not convex it converged only to a local minimum. 

Now try the same exercise starting from the initial point xq = 8. 
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Figure 6: Example of running gradient descent starting on point xq = —8 for function f{x) = (x + 2)^ — 
16 exp ( — (x — 2) ) . The function is represented in blue, while the points of the minimizing sequence are dis- 
played as green plus signs. 



x_0 = 8 

res = gradient_descent (x_0, get_y, get_grad) 
plot (res [ : , 0] , res [ : , 1] , '+') 



FigureY\shows the resulting minimizing sequence. Note that now the algorithm converged to the global minimum. 
However, note that to get to the global minimum the sequence of points jumped from one side of the minimum to the other. 
This is a consequence of using a wrong step size (in this case too large). Repeat the previous exercise changing both the 
values of the step-size and the precision. What do you observe? 

During this school we will rely on the numerical optimization methods provided by Scipy (scientific com- 
puting library in python), which are very efficient implementations. 

0.5 Python Exercises 
0.5.1 Numpy and Matplotlib 

Exercise 0.13 1. Consider the function f {x) = (x -|- 2)^ — 16exp (— (x — 2)^). Draw a plot around the x G 
[—8,8] region. 

2. What is ^? 
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Figure 7: Example of running gradient descent starting on point xq = 8 for function f{x) = (x + 2)^ — 
16 exp ( — (x — 2) ) . The function is represented in blue, while the points of the minimizing sequence are dis- 
played as green plus signs. 



3. Use gradient descent to find a local minimum starting from xq = —4 and xq = +4, with t] = .01. Plot all of the 
intermediate estimates that you obtain in the same plot. 

import numpy as np 

import matplotlib .pyplot as pit 

X = np. linspace (-8, 8, 1000) 

Y = (X+2)**2 - 16*np.exp (- ( (X-2) **2) ) 

# derivative of the function f 
def get_Y_dev ( X ) : 

return (2 *x+4 ) -1 6 * (-2 *x + 4 ) *np . exp (- ( (x-2 ) **2 ) ) 

def grad_desc (start_x, eps, prec) : 

runs the gradient descent algorithm and returns the list of estimates 

example of use grad_desc (X, 0.01, 0.00001) 
f I 1 

x_new = start_x 

x_old = start_x + prec * 2 

res = [x_new] 

while abs (x_old-x_new) > prec: 
x_old = x_new 

x_new = x_old - eps * get_Y_dev (x_new) 
res . append (x_new) 
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return np . array ( res ) 
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Over the next couple of exercises we will make use of the Galton dataset, a dataset of heights of fathers and 
sons from the 1877 paper that first discussed the "regression to the mean" phenomenon. 

Exercise 0.14 • Use the load () function in the galton. py file to load the dataset. 

• What are the mean height and standard deviation of all the people in the sample! What is the mean height of the 
fathers and of the sons? 

• Plot a histogram of all the heights (you might want to use the pit . hist function and the ravel method on 
arrays). 

• Plot the height of the father versus the height of the son. 

• You should notice that there are several points that are exactly the same (e.g., there are 21 pairs with the values 68.5 
and 70.2). Use the ? command in ipython to read the documentation for the numpy. random . rand function 
and add random jitter (i.e., move the point a little bit) to the points before displaying them. Does your impression 
of the data change! 

Exercise 0.15 Consider the linear regression problem (ordinary least squares), with a single response variable 

y = x^w + e 

The linear regression problem is, given a set {y^'^ji of samples ofy and the corresponding x^''^ vectors, estimate w 
to minimise the sum of the e variables. Traditionally this is solved analytically to obtain a closed form solution (although 
this is not the way in which it should he computed, linear algebra packages have an optimised solver, with numpy, 
use numpy . linalg . Istsq). 

Alternatively, we can define the error function for each possible w: 

1. Derive the gradient of the error J^. 

2. Implement a solver based on this for two dimensional problems (i.e., w e R^). 

3. Use this method on the Galton dataset from the previous exercise to estimate the relationship between father and 
son's height. Try two formulas 

s = fwi + e, (9) 
where s is the son's height, and f is the father heights; and 

s = fwi + IWq + £ (10) 

where the input variable is now two dimensional (/, 1). This allows the intercept to be non-zero. 

4. Plot the regression line you obtain with the points from the previous exercise. 

5. Use the np . linalg . Istsq function and compare to your solution. 

0.5.2 Debugging 

Exercise 0.16 Use the debugger to debug the buggy .py script which attempts to repeatedly perform the following 
computation: 

1. Start xq = 0 

2. Iterate 

(a) Xj^j = xt + r, where r is a random variable. 

(b) ifx^_^^ >= 1., then stop. 
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(c) ifXf_^_^ <— 0., then Xt+i — 0 

(d) else Xt+i — ^f+i- 

3. Return the number of iterations. 

Having repeated this computation a number of times, the programme prints the average. Unfortunately, the program 
has a few bugs, which you need to fix. 
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Dayl 

Classification 



This day will serve as an introduction to machine learning. We recall some fundamental concepts about deci- 
sion theory and classification. We also present some widely used models and algorithms and try to provide 
the main motivation behind them. There are several textbooks that provide a thorough description of some 
of the concepts introduced here: for example, Mitchell ( |1997[ |, Duda et al. ( |2001^ , Scholkopf and Smola (j2002|, 
Joachims ( |2002[ |, [Bishop (2 006| , [Manning et"ar ( |2008| , to name just a few. The concepts that we introduce in this 



chapter will be revisited in later chapters, where the same algorithms and models will be adapted to structured 
inputs and outputs. For now, we concern only with multi-class classification (with just a few classes). 



Today's assignment 

The assignment of today's class is to implement a classifier called Naive Bayes, and use it to perform sentiment 
analysis on a corpus of book reviews from Amazon. 



1.1 Notation 

In what follows, we denote by X our input set (also called observation set) and by y our output set. We will make 
no assumptions about the set X, which can be continuous or discrete. In this lecture, we consider classification 
problems, where ^ = {ci, . . . , Cj^} is a finite set, consisting of K classes (also called labels). For example, X can be 
a set of documents in natural language, and y a set of topics, the goal being to assign a topic to each document. 

We use upper-case letters for denoting random variables, and lower-case letters for value assignments to 
those variables: for example, 

• X is a random variable taking values on X, 

• Y is a random variable taking values on 

• X e X and y £ y are particular values for X and Y. 

We consider events such as X = x, Y = y, etc. 

For simplicity reasons, throughout this lecture we will use modified notation and let P(y) denote the prob- 
ability associated with the event Y = y (instead of -Py(Y = y)). Also, joint and conditional probabilities are 
denoted as P{x,y) = Px,y{X = x A Y = y) and -P(x|y) = Px|y(-^ ~ ^ \ ^ ~ V)' respectively. From the laws of 
probabilities: 

P(x,y) = P(y|x)P(x) = P(x|y)P(y), (1.1) 

for all X e X and y G y. 

Quantities that are predicted or estimated from the data will be appended a hat-symbol: for example, 
estimations of the probabilities above are denoted as -P(y), -P(x,y) and P(y|x); and a prediction of an output 
will be denoted y. 

We assume that a training dataset D is provided which consists of M input-output pairs (called examples or 
instances): 

2) = {(xi,yi),...,(x*^,y*^)} CXxy. (1.2) 

The goal of (supervised) machine learning is to use the training dataset D to learn a function h (called a 
classifier) that maps from X to y: this way, given a new instance x £ X (test example), the machine makes a 
prediction y by evaluating h on x, i.e., y = h{x). 
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1.2 Generative Classifiers: NaiVe Bayes 

If we knew the true distribution P{X, Y), the best possible classifier (called Bayes optimal) would be one which 
predicts according to 



argmaxP(y|x) 



arg max 



P{x,y) 



yey P{x) 
argmaxPfx, v) 

argmaxP(y)P(x|y), 



(1.3) 



where in t we used the fact that P{x) is constant with respect to y. 

Generative classifiers try to estimate the probability distributions P{Y) and P{X\Y), which are respectively 
called the class prior and the class conditionals. They assume that the data are generated according to the follow- 
ing generative story (independently for each m = 1, . . . , M): 

1. A class ym ^ P{Y) is drawn from the class prior distribution; 

2. An input x,„ ^ P{X\Y = y„i) is drawn from the corresponding class conditional. 

Figure 1.1 shows an example of the Bayes optimal decision boimdary for a toy example with K = 2 classes, 
M = 100 points, class priors P{yi) = Piyi) = 0.5, and class conditionals P(x|i/,) given by 2-D Gaussian 
distributions with the same variance but different means. 



Simple 
2.5 

2.0 

1.5 

1.0 

0.5 

0.0 

-0.5 

-1.0 

-1.5 

-2.0 



Data Set-- Meanl= (-1.00,-1.00) Varl = 0.50 Mean2= (1,00,1.00) Var2= 0.50 
Nr. Points=100.00, Balarce=0.50 Train-Dev-Test (0.80,-0.00,0.20) 



Bayes Optimal 



> ■ ■ ■ ■ 



- ■ 



-3 



Figure 1.1: Example of a dataset together with the corresponding Bayes optimal decision boundary. The input 
set consists in points in the real plane, X = and the output set consists of two classes (Red and Blue). 
Training points are represented as squares, while test points are represented as circles. 
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1.2.1 Training and Inference 

Training a generative model amounts to estimating the probabilities P{Y) and P{X\Y) using the dataset D, 
yielding estimates P{i/) and P{x\y). This estimation is usually called training or learning. 

After we are done training, we are given a new input x G X, and we want to make a prediction according 

to 

y = argmaxP(y)P(x|y), (1.4) 

using the estimated probabilities. This is usually called inference or decoding. 
We are left with two important problems: 

1. How should the distributions P{Y) and P(X| Y) be "defined"? (i.e., what kind of independence assump- 
tions should they state, or how should they factor?) 

2. How should parameters be estimated from the training data D? 

The first problem strongly depends on the application at hand. Quite often, there is a natural decomposition 
of the input variable X into / components, 

X=(Xi Xj). (1.5) 

The naive Bayes method makes the following assumption: X^, . . .,Xj are conditionally independent given the 
class. Mathematically, this means that 

/ 

P{X\y) = n^(^;l^)- (1-6) 

M 

Note that this independence assumption greatly reduces the number of parameters to be estimated (degrees of 
freedom) from 0(exp(/)) to 0(/), hence estimation of P{X\Y) becomes much simpler, as we shall see. It also 
makes the overall computation much more efficient for large / and it decreases the risk of overfitting the data. 
On the other hand, if the assumption is over-simplistic it may increase the risk of under-fitting. 

For the second problem, one of the simplest ways to solve it is using maximum likelihood estimation, which 
aims to maximize the probability of the training sample, assuming that each point was generated indepen- 
dently. This probability (call it P(2))) factorizes as 

M 

P(D) = nP(x™,y'") 
m=l 
M 7 

= IlPiynY\H^T\yn- (1-7) 

m=l i=l 

1.2.2 Example: Multinomial Naive Bayes for Document Classification 

We now consider a more realistic scenario where the naive Bayes classifier may be applied. Suppose that the 
task is document classification: X is the set of all possible documents, and y = {yi, . . . , i/j^} is a set of classes for 
those documents. Let V = {wi, . . . ,wj}he the vocabulary, i.e., the set of words that occur in some document. 

A very popular document representation is through a "bag-of-words": each document is seen as a collec- 
tion of words along with their frequencies; word ordering is ignored. We are going to see that this is equivalent 
to a naive Bayes assumption with the multinomial model. We associate to each class a multinomial distribution, 
which ignores word ordering, but takes into consideration the frequency with which each word appears in a 
document. For simplicity, we assume that all documents have the same length ij^ Each document x is as- 
sumed to have been generated as follows. First, a class y is generated according io P{y). Then, x is generated 
by sequentially picking words from V with replacement. Each word Wj is picked with probability P{iVj\y). For 
example, the probability of generating a document x = Wj^. . . Wj^ {i.e., a sequence of L words Wj^,..., WjJ is 

1=1 j=i 



^We can get rid of this assumption by defining a distribution on the document length. Everytfung stays the same if that distribution is 
uniform up to a maximum document length. 
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where nj{x) is the number of occurrences of word zvj in document x. 

Hence, the assumption is that word occurrences (tokens) are independent given the class. The parameters 
that need to be estimated are P{yi ), ■ ■ ■ ,P{yK)> arid P{iVj\y]f) for ] = !,...,] and k = 1, . . . ,K. Given a trarnrng 
sample f = { (x^, ), . . . , [x^, y^)}, denote by the indices of those instances belonging to the kth class. The 
maximum likelihood estimates of the quantities above are: 

In words: the class priors' estimates are their relative frequencies (as before), and the class-conditional word 
probabilities are the relative frequencies of those words across documents with that class. 



1.3 Assignment 

With the previous theoretical background, you will be able to solve today's assignment. 



Exercise 1.1 In this exercise we will use the Amazon sentiment analysis data ^Blitzer etoL 2007 \ where the goal is to 



classify text documents as expressing a positive or negative sentiment (i.e., a classification problem with two classes). 
We are going to focus on hook reviews. To load the data, type: 



import Ixmls . readers . sentiment_reader as srs 
import Ixmls . classifiers . naive_bayes nb 

scr = srs . Sent imentCorpus ( "books " ) 



This will load the data in a bag-of-words representation where rare words (occurring less than 5 times in the training 
data) are removed. 

1. Implement the Naive Bayes algorithm. Open the file multinomial_naive_bayes .py, which is inside the 
classifiers folder. In the KultinomialNaiveBayes class you will find the train method. We have 
already placed some code in that file to help you get started. 

2. After implementing, run Naive Bayes with the multinomial model on the Amazon dataset (sentiment classification) 
and report results both for training and testing: 

import Ixmls . classifiers .multinomial_naive_bayes mnbb 
mnb = mnbb .Mult inomialNaiveBayes 0 

params_nb_sc = mnb . train ( scr . train_X, scr . train_y) 
y_pred_train = mnb . test (scr . train_X, params_nb_sc) 
acc_train = mnb . evaluate ( scr . train_y, y_pred_train) 
y_pred_test = mnb . test (scr . test_X, params_nb_sc) 
acc_test = mnb . evaluate (scr . test_y, y_pred_test) 

print "Multinomial Naive Bayes Amazon Sentiment Accuracy train: %f test: %f"%( 
acc_train, acc_test) 



3. Observe that words that were not observed at training time cause problems at test time. Why? To solve this 



problem, apply a simple add-one smoothing technique: replace the expression in Eq. 1.9 for the estimation of the 
conditional probabilities by 

PlWjlC].) = — . 

' J + LLLme3,n,(x'n) 

where J is the number of distinct words. 

This is a widely used smoothing strategy which has a Bayesian interpretation: it corresponds to choosing a uniform 
prior for the word distribution on both classes, and to replace the maximum likelihood criterion by a maximum a 
posteriori approach. This is a form of regularization, preventing the model from overfitting on the training data. 
See e.g. Manning and SchUtze ^999^; Manning etal. ( 2008 ' for more information. Report the new accuracies. 
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1.4 Discriminative Classifiers 



In the previous sections we discussed generative classifiers, which require us to model the class prior and class 
conditional distributions (P(Y) and P(X|Y), respectively). Recall, however, that a classifier is any function 
which maps objects x G X onto classes J/ G While it's often useful to model how the data was generated, it's 
not required. Classifiers that do not model these distributions are called discriminative classifiers. 



1.4.1 Features 

For the purpose of understanding discriminative classifiers, it is useful to think about each x e X as an abstract 
object which is subject to a set of descriptions or measurements, which are called features. A feature is simply a 
real number that describes the value of some property of x. For example, in the previous section, the features 
of a document were the number of times each word Wj appeared in it. 
Let gi{x),...,gj{x)he } features of x. We call the vector 

g{x) = {gi{x) gj{x)) (1.10) 

a feature vector representation of x. The map g : X — > is called a feature mapping. 

In NLP applications, features are often binary-valued and result from evaluating propositions such as: 



^1 



. s A r 1/ if sentence x contains the word Ronaldo ^-.^ 
= 1 0, otherwise. ^^'^^^ 

. ^ A f 1, if all words in sentence x are capitalized ^ ^, 

= I 0, otherwise. ^^-^^^ 

, . A f 1, if X contains any of the words amazing, excellent or . ^, 
= jo, otherwise. ^^"^^^ 

In this example, the feature vector representation of the sentence "Ronaldo shoots and scores an amazing 
goal!" would be = (1,0,1). 

In multi-class learning problems, rather than associating features only with the input objects, it is useful 
to consider joint feature mappings / : X x y ^ IR^. In that case, the joint feature vector f{x, y) can be seen as a 
collection of joint input-output measurements. For example: 

^, \ A / 1' if ^ contains Kowfl/do, and topic y is sport 

Ji{x,y) - I 0, otherwise. ^ ' 

\ A / 1' if X contains RonaWo, and topic 1/ is politics /i icn 

f2{x,y) - otherwise. ^^'^^^ 

A very simple form of defining a joint feature mapping which is often employed is via: 

f{x,y) = g{x)®ey 

= (0,...,0, g(x),0,...,0) (1.16) 

yth slot 

where g{x) e IR^ is a input feature vector, 0 is the Kronecker product {[a ® h\ij — aibj) and Cy G M*-, with 
[cyjc = liiiy — c, and 0 otherwise. Hence f{x,y) e R^^^. 

1.4.2 Inference 

Linear classifiers are very popular in natural language processing applications. They make their decision based 
on the rule: 

y = ax^maxw- f{x,y). (1.17) 

where 

•we is a weight vector; 

• f{x,y) e R^ is a feature vector; 
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• w ■ f{x,y) = E£=i ^(ifdi^'V) is the inner product between w and f{x,y). 

Hence, each feature {x, y) has a weight 0;^ and, for each class y G y, a score is computed by linearly combin- 
ing all the weighted features. All these scores are compared, and a prediction is made by choosing the class 
with the largest score. 



Remark 1.1 With the design above (Eq. 1.16*, and decomposing the weight vector as w = {wc^, ... ,Wc^),'we have that 

^v- f{x,y) =ivyg{x). (1.18) 

In words: each class y e ^ gets its own weight vector Wy, and one defines a input feature vector g{x) that only looks at 
the input x e X. This representation is very useful when features only depend on input x since it allows a more compact 
representation. Note that the number of features is normally very large. 

Remark 1.2 The multinomial naive Bayes classifier described in the previous section is an instance of a linear classifier 
Recall that the naive Bayes classifier predicts according toy = argmax^gy P{y)P{x\y). Taking logs, in the multinomial 
model for document classification this is equivalent to: 

y = argmaxlogP(y) +logP(x|y) 

/ 

= argmaxlogP(y) + ^ n^(x) logP(w;^|y) 

= argmaxify ■ ^(x), (1.19) 



where 



{by,\o^P{wi\y),...,\o^P{wj\y)) 

logP(y) 



g{x) = (l,ni(x),...,nj(x)). (1.20) 
Hence, the multinomial model yields a prediction rule of the form 

y = argmaxwy ■ (1.21) 

1.4.3 Online Discriminative Algorithms 

We now discuss two discriminative classification algorithms. These two algorithms are called online (or stochas- 
tic) algorithms because they only process one data point (in our example, one document) at a time. Algorithms 
which look at the whole dataset at once are called offline, or batch algorithms, and will be discussed later. 

Perceptron 

The perceptron (jRosenblatt 1958| is perhaps the oldest algorithm used to train a linear classifier The perceptron 



works as follows: at each round, it takes an element x from the dataset, and uses the current model to make 
a prediction. If the prediction is correct, nothing happens. Otherwise, the model is corrected by adding the 
feature vector w.r.t. the correct output and subtracting the feature vector w.r.t. the predicted (wrong) output. 
Then, it proceeds to the next round. 

Alg.|2]shows the pseudo-code of the perceptron algorithm. As it can be seen, it is remarkably simple; yet it 
often reaches a very good performance, often better than the Naive Bayes, and usually not much worse than 
maximum entropy models or SVMs (which will be described in the following section). 

A weight vector w defines a separating hyperplane if it classifies all the training data correctly, i.e., if y™ — 
arg maXj^gy w ■ f{x"', y) hold for m = 1, . . . , M. A dataset D is separable if such a weight vector exists (in general, 
w is not unique). A very important property of the perceptron algorithm is the following: if D is separable, 
then the number of mistakes made by the perceptron algorithm until it finds a separating hyperplane is finite. 
This means that if the data are separable, the perceptron will eventually find a separating hyperplane w. 

There are other variants of the perceptron (e.g., with regularization) which we omit for brevity. 



^Actually, we are showing a more robust variant of the perceptron, which averages the weight vector as a post-processing step. 
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Algorithm 2 Averaged perceptron 



1: input: dataset D, number of rounds R 

2: initialize f = 0, ' = 0 

3: for r = 1 to do 

4: Vs = shuffle(D) 

5: for f = 1 to M do 

6: m = Vs{i) 

7: t = t + l 

8: take training pair (x™, y™ ) and predict using the current model: 

y <— argmaxif' ■ f{x"',y') 

9: update the model: zf' +/(x'",y'") - f{x"^,y) 

10: end for 
11: end for 

12: output: the averaged model w i — j YLi=i 



Exercise 1.2 V^e provide an implementation of the perceptron algorithm in theclass Perceptron (/i7e perceptron .py). 
1. Run the following commands to generate a simple dataset similar to the one plotted on Figure ^^ 
import Ixmls . readers . simple_data_set sds 

sd = sds . SimpleDataSet (nr_examples=100 , gl = [[-1,-1],1], g2 = [[1,1],1], balance=0. 
5, split = [0 .5, 0, 0 . 5] ) 



2. Run the perceptron algorithm on the simple dataset previously generated and report its train and test set accuracy: 



import Ixmls . class! fiers . perceptron percc 
perc = percc . Perceptron ( ) 

params_perc_sd = perc . train (sd. train_X, sd . train_y) 
y_pred_train = perc . test ( sd. train_X, params_perc_sd) 
acc_traln = perc . evaluate (sd. train_y, y_pred_train) 
y_pred_test = perc . test (sd . test_X, params_perc_sd) 
acc_test = perc . evaluate ( sd . test_y, y_pred_test) 

print "Perceptron Simple Dataset Accuracy train: %f test: %f"%(acc_train,acc_test) 



3. Plot the decision boundary found: 



fig, axis = sd . plot_data ( ) 

fig, axis = sd . add_line ( fig, axis , params_perc_sd, "Perceptron", "blue") 



Change the code to save the intermediate weight vectors, and plot them every five iterations. What do you observe? 
4. Run the perceptron algorithm on the Amazon dataset. 

Margin Infused Relaxed Algorithm (MIRA) 

The MIRA algorithm iCrammer and Singerl 2002 Crammer et al. 20061 has achieved very good performance 



in NLP problems. Recall that the Perceptron takes an input pattern and, if its prediction is wrong, adds the 
quantity [f{x"\ y™ ) - /(x™, y)] to the weight vector. MIRA changes this by adding r]^[f{x'", y"' ) - f{x"\ y)] to 
the weight vector. The difference is the step size t]'^, which depends on the iteration t. 

There is a theoretical basis for this algorithm, which we now briefly explain. At each round t, MIRA updates 
the weight vector by solving the following optimization problem: 
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Algorithm 3 MIRA 



1: input: dataset D, parameter A, number of rounds R 

2: initialize t = 0,w'^ = 0 

3: for r = 1 to K do 

4: Ds = shuffle(I') 

5: for f = 1 to M do 

6: nt = Vs{i) 

7: t = t + l 

8: take training pair (x™, ) and predict using the current model: 

y argmaxw' ■ f{x"',y') 

9: compute loss: = ■ f{x"\y) - w* ■ f{x'",y"') + p{y,y'") 
10: compute stepsize: = min { 

11: update the model: ^ + rj' {f{x'" ,y"') - /{x'" ,y)) 
12: end for 
13: end for 

14: output: the averaged model w i — j Yli=i 



w'+^^argmin ^+^\\w-w^\\'^ (1.22) 

s.t. «;-/(x"',y'«) > it;-/(x"',y) + l-^ (1.23) 

^ > 0, (1.24) 

where y = argmax^^/^y zf' ■ f{x"',y') is the prediction using the model with weight vector wK By inspecting 



Eq. 1.22 we see that MIRA attempts to achieve a tradeoff between conservativeness (penalizing large changes 
from the previous weight vector via the term j\\w — w^\\^) and correctness (by requiring, through the con- 
straints, that the new model w^^^ "separates" the true output from the prediction with a margin (alth ough 



1.22 



slack ^ > 0 is allowed)]^ Note that, if the prediction is correct (y = y'") the solution of the problem Eq 
leaves the weight vector unchanged (w = w^). This quadratic programming problem has a closed form 
solutionis 

^t+l ^ ^t + ^t^^^^m^ym^ _ /(x'«,y)). 



f/' = min I A 



with 

-1 w'-f{x -^,y)-w'-f{x-^,y-')+p {y,y' 
||/(x'",y'«)-/(x-,y)||2 

where p : y x y ^ ]R+ is a non-negative cost function, such that p{y,y) is the cost incurred by predicting y 
when the true output is y; we assume p{y,y) = 0 for all y G y. For simplicity, we focus here on the 0/1-cost 
(but keep in mind that other cost functions are possible): 

P^^'V) = I I othetJise. (^-25) 



MIRA is depicted in Alg.|3] For other variants of MIRA, see Crammer et al. ( |2006) . 



Exercise 1.3 Implement the MIRA algorithm (Hint: use the perceptron algorithm as a starting point and modify it as 
necessary). Do this by creating a file Mir a .py and implement class Mir a. Then, repeat the perceptron exercise now 
using MIRA, for several values of\: 

import Ixmls . classifiers .mira : mirac 

mira = mirac . Mira ( ) 

mira . regular izer = 1.0 i This is lambda 
params_mira_sd = mira . train (sd. train_X, sd. train_y) 



^The intuition for this large margin separation is the same for support vector machines, which will be discussed in 1 1 .4.4 
*Note that the perceptron updates are identical, except that we always have ijt = 1- 
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y_pred_train = mira . test ( sd. train_X, params_mira_sd) 
acc_train = mira . evaluate (sd . train_y, y_pred_train) 
y_pred_test = mira . test (sd . test_X, params_mira_sd) 
acc_test = mira . evaluate (sd . test_y, y_pred_test) 

print "Mira Simple Dataset Accuracy train: %f test: %f"%(acc_train,acc_test) 
fig, axis = sd. add_line (fig , axis , par ams_mira_sd, "Mira", "green" ) 

params_mira_sc = mira . train (scr . train_X, scr . train_y) 
y_pred_train = mira . test ( scr . train_X, params_mira_sc) 
acc_train = mira . evaluate (scr . train_y , y_pred_train) 
y_pred_test = mira . test ( scr . test_X, params_mira_sc) 
acc_test = mira . evaluate (scr . test_y, y_pred_test) 

'^■•'■'nt "Mira Amazon Sentiment Accuracy train: %f test: %f"%(acc_train,acc_test) 



Compare the results achieved and separating hiperplanes found. 
1.4.4 Batch Discriminative Classifiers 

As we have mentioned, the perceptron and MIRA algorithms are called online or stochastic because they look 
at one data point at a time. We now describe two discriminative classifiers which look at all points at once; 
these are called offline or batch algorithms. 

Maximum Entropy Classifiers 



The notion of entropy in the context of Information Theory ( Shannon 1948) is one of the most significant ad 



vances in mathematics in the twentieth century. The principle of maximum entropy (which appears under 
different names, such as "maximum mutual information" or "minimum Kullbac k-Leibler divergence") plays 
a fundamental role in many methods in statistics and machine learning (Ja5mes 1982^ . |^ The basic rationale 



is that choosing the model with the highest entropy (subject to constraints that depend on the observed data) 
corresponds to making the fewest possible assumptions regarding what was unobserved, making imcertainty 
about the model as large as possible. 

For example, if we throw a die and want to estimate the probability of its outcomes, the distribution with 
the highest entropy would be the uniform distribution (each outcome having of probability a 1/6). Now 
suppose that we are only told that outcomes {1, 2, 3} occurred 10 times in total, and {4, 5, 6} occurred 30 times 
in total, then the principle of maximum entropy would lead us to estimate P(l) = ^^(2) = ^^(3) = 1/12 and 
P(l) = P(2) = P(3) = 1/4 (i.e., outcomes would be uniform within each of the two groups)]^ 

This example could be presented in a more formal way. Suppose that we want to use binary features to 
represent the outcome of the die throw. We use two features: /i23(x, y) = 1 if and only if y G {1,2,3}, and 
/456(x,y) = 1 if and only if y e {4,5,6}. Our observations state that in 40 throws, we observed /123 10 times 
(25%) and /456 30 times (75%). The maximum entropy principle states that we want to find the parameters 
w of our model, and consequently the probability distribution Pa;(y|X), which makes /123 have an expected 
value of 0.25 and /jsg have an expected value of 0.75. These constraints, E[/i23] = 0.25 and E[/456] = 0.75, are 
known as first moment matching constraints^ 

An important fundamental result, which we will not prove here, is that the maximum entropy distribution 
Pjy (Y|X) under first moment matching constraints is a log-linear modeZ. |^ It has the following parametric form: 



The denominator in Eq. 1.26 is called the partition function: 



Z{zv,x) = ^ exY>{w ■ f{x,y')). (1.27) 



Cover et al. 



1991 



^For an excellent textbook on Information Theory, we recommend 
^For an introduction of maximum entropy models, along with pointers to ttie literature, see http : / /www. cs . emu . edu/ -aberger/ 



[maxent . ht ml 

'In general, these constraints mean that feature expectations under that distribution Ylm Ey~Pm [/(^m/ ^)] must match the observed 
relative frequencies ^ E„, f{x„,,y,„ ). 

*Also called a a Boltzmarm distribution, or an exponential family of distributions. 
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An important property of the partition function is that the gradient of its logarithm equals the feature expec- 
tations: 



Vit,logZ(zt7,x) = E^[f{x,Y)] 

= ^ P„(y'|x)/(x,y')- 



The average conditional log-likelihood is: 



— logP,„(y\. 



M 



-iogn^'^.(y'"k'") 



M 
1 

M 
1 

M 



m=l 

M 

Y: logP..(y'"|x'") 

m=l 
M 

Y: {wf{x'-,y"')-\o^Z{w,x"')). 

m=l 



(1.28) 



(1.29) 



We try to find the parameters w that maximize the log-likelihood £i{w;T>); to avoid overfitting, we add a 
regularization term that penalizes values of w that have a high magnitude. The optimization problem becomes: 



w 



argmax£(i{;;D) — TrUifH^ 

to ^ 



arg min —Ziw; T)) 



A, 



\w\ 



(1.30) 



Here we use the squared L2-norm as the regularizer|^but other norms are possible. The scalar A > 0 controls 
the amount of regularization. Unlike the naive Bayes examples, this optimization problem does not have 
a closed form solution in general; hence we need to resort to numerical optimization (see section 0.4 1. Let 
^xiw, CD) = —Z{w; T)) + ^ W'^W^ be the objective function in Eq. 
that a local optimum of Eq. 
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1.30 



This function is convex, which implies 



is also a global optimum. Fx{w; D) is also differentiable: its gradient is 



V„,Fa(u;;2)) 



1 



M 



M 



^ (-/(x'",y'") + V^logZ(u.,x'")) + An; 

m=l 
I M 

j:(-/(x-,y'")+E^[/(x™,Y)])+Az<;. 



M 



(1.31) 



m=l 



A batch gradient method to optimize Eq. 1.30 is shown in Alg. |4] Essentially, Alg. |4] iterates through the 
following updates until convergence: 



w 



f+i 



^ zt;'-/7tV^FA(zt;';D) 



M 



{\-M^)w' + nt^ E (/(x'",y'«)-E..[/(x'«,Y)]). 



(1.32) 



m=l 



Convergence is ensured for suitable stepsizes t/j. Monotonic decrease of the objective value can also be ensured 
if r]t is chosen with a suitable line search method, such as Armijo's rule ( Nocedal and Wright [1999 j |. In practice, 
more sophisticated methods exist for optimizing Eq. [1.30} such as conjugate gradient or L-BFGS. The latter is 
an example of a quasi-Newton method, which only requires gradient information, but uses past gradients to 
try to construct second order (Hessian) approximations. 

In large-scale problems (very large M) batch methods are slow. Online or stochastic optimization are at- 
tractive alternative methods. Stochastic gradient methods make "noisy" gradient updates by considering only 
a single instance at the time. The resulting algorithm, called Stochastic Gradient Descent (SGD) is shown as 
Alg.|5] At each round t, an instance m{t) is chosen, either randomly (stochastic variant) or by cycling through 
the dataset (online variant). The stepsize sequence must decrease with t: typically, rjt = '70^ * for some //o > 0 



'in a Bayesian perspective, this corresponds to choosing independent Gaussian priors p(ri'rf) 
weight vector. 



!N(0; 1/A^) for each dimension of the 
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Algorithm 4 Batch Gradient Descent for Maximum Entropy 



1: input: D, A, number of rounds T, 
learning rate sequence {f]t)t=l,...,T 
2: initialize ^ = 0 
3: for t = 1 to T do 
4: for ;« = 1 to M do 

5: take training pair (x'",i/'") and compute conditional probabilities using the current model, for each 

p /„/|^m^ _ expjw* ■ f{x'",y')) 
"''^^1 ^" Z{w,x) 

6: compute the feature vector expectation: 

E^4f{^'^y)]= L Pw>{y'\nf{x'",y') 

7: end for 

8: choose the stepsize 7/f using, e.g., Armijo's rule 
9: update the model: 

M 
m=l 

10: end for 

11: output: w w^~^^ 



and a G [1, 2], tuned in a development partition or with cross-validation. 

Exercise 1.4 We provide an implementation of the L-BFGS algorithm for training maximum entropy models in the class 
MaxEnt Joatch, as well as an implementation of the SGD algorithm in the class MaxEnt_online. 

1. Train a maximum entropy model using L-BFGS on the Simple data set (try different values of \). Compare the 
results with the previous methods. Plot the decision boundary. 

import Ixmls . classifiers .max_ent_batch mebc 

me_lbfgs = mebc . MaxEnt_batch ( ) 
me_lbfgs . regular izer =1.0 

params_meb_sd = me_lbf gs . train (sd . train_X, sd . train_y) 
y_pred_train = me_lbfgs .test (sd.train_X, params_meb_sd) 
acc_train = me_lbf gs . evaluate (sd. train_y, y_pred_train) 
y_pred_test = me_lbf gs . test (sd . test_X, par ams_meb_sd) 
acc_test = me_lbf gs . evaluate ( sd . test_y, y_pred_test) 

print "Max-Ent batch Simple Dataset Accuracy train: %f test: %f"%(acc_train,acc_test 
) 

fig, axis = sd . add_line ( fig, axis , params_meb_sd, "Max-Ent-Batch" , "orange") 



2. Train a maximum entropy model using L-BFGS, on the Amazon dataset (try different values of Aj and report 
training and test set accuracy. What do you observe? 

params_meb_sc = me_lbfgs . train (scr . train_X, scr . train_y) 
y_pred_train = me_lbfgs . test (scr . train_X, params_meb_sc) 
acc_train = me_lbfgs . evaluate (scr . train_y, y_pred_train ) 
y_pred_test = me_lbfgs . test (scr . test_X, params_meb_sc) 
acc_test = me_lbf gs . evaluate ( scr . test_y, y_pred_test) 

print "Max-Ent Batch Amazon Sentiment Accuracy train: %f test: %f"% (acc_train, 
acc_test ) 
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Algorithm 5 SGD for Maximum Entropy 



1: input: D, A, number of rounds T, 
learning rate sequence {f]t)t=l,...,T 
2: initialize ^ = 0 
3; for f = 1 to T do 
4: choose m = m{t) randomly 

5: take training pair (x™, y'") and compute conditional probabilities using the current model, for each y' G 
■y- 

p ^,/|^m^ _ e^w^ ■ f{x"',y')) 
^"'^^1 >~ Z(w,x) 

6: compute the feature vector expectation: 
7: update the model: 

^ (1 _ xtjt)w* + (/(x'«,y"') - E^[f{x^,Y)]) 

8: end for 

9: output: tl' <— 



3. Now, fix A = 1.0 fraf« zfzY^ SGD (yow might try to adjust the initial step). Compare the objective values 
obtained during training with those obtained with L-BFGS. What do you observe? 

import Ixmls . classifiers .max_ent_online ■ meoc 

me_sgd = meoc . MaxEnt_online ( ) 
me_sgd. regularizer =1.0 

params_meo_sc = me_sgd . train (scr . train_X, scr . traln_y) 
y_pred_train = me_sgd. test (scr . train_X, params_meo_sc) 
acc_train = me_sgd . evaluate (scr . train_y, y_pred_train) 
y_pred_test = me_sgd. test (scr .test_X,params_meo_sc) 
acc_test = me_sgd . evaluate (scr . test_y, y_pred_test) 

prj.. "Max-Ent Online Amazon Sentiment Accuracy train: %f test: %f"% (acc_train, 
acc_test ) 



Support Vector Machines 



Support vector machines are also a discriminative approach, but they are not a probabilistic model at all. The 
basic idea is that, if the goal is to accurately predict outputs (according to some cost function), we should focus 
on that goal in the first place, rather than trying to estimate a probability distribution (P(Y|X) or P{X,Y)), 
which is a more difficult problem. As Vapnik ( 1995 ) puts it, "do not solve an estimation problem of interest by 
solving a more general (harder) problem as an intermediate step." 

We next describe the primal problem associated with multi-class support vector machines (Crammer and 



Singer 2002 1, which is of primary interest in natural language processing. There is a significant amount of liter- 
ature about Kernel Methods ( [Scholkopf and Smola 2002 Shawe-Taylor and Cristianini 2004 1 mostly focused 
on the dual formulation. We will not discuss non-linear kernels or this dual formulation herein 

Consider p{y',y) as a non-negative cost function, representing the cost of assigning a label y' when the 



correct label was y. For simplicity, we focus here on the 0/1-cost defined by Equation 1.25 (but keep in mind 
that other cost functions are possible). The hinge Zos^^is the function 



e{w;x,y) = max [w ■ f{x,y') - w ■ f{x,y)+p{y',y)\ 

y'ey 



(1.33) 



^"The main reason why we prefer to discuss the primal formulation with linear kernels is that the resulting algorithms run in linear time 
(or less), while known kernel-based methods are quadratic with respect to M. In large-scale problems (large M) the former are thus more 
appealing. 

^^The hinge loss for the 0/1 cost is sometimes defined as i{w, x,y) = max{0,maxy/-^y w ■ f{x,y') — w ■ f{x,y) + 1}. Given our definition 
of p{y,y), note that the two definitons are equivalent. 
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Algorithm 6 Stochastic Subgradient Descent for SVMs 
1: input: D, A, number of rounds T, 
learning rate sequence {f]t)t=l,...,T 
2: initialize ^ = 0 
3; for f = 1 to T do 
4: choose m = m{t) randomly 

5: take training pair (x'",y"') and compute the "cost-augmented prediction" under the current model: 

y = argmaxi^^ • f{x"\y') - ■ /(x^y™) + p{y' ,y) 

6: update the model: 

^ (1 _ Xf^t)w* + i^t (/(x'«,y"') - f{x^,y)) 

7: end for 

8: output: IV <— If 



Note that the objective of Eg. [1.33 becomes zero when y' = y. Hence, we always have l{w, x, y) > 0. Moreover, 
if p is the 0/1 cost, we have £{w,x,y) =Oif and only if the weight vector is such that the model makes a correct 
prediction with a margin greater than 1: i.e., w ■ f{x,y) > w ■ f{x,y') + 1 for all y' ^ y. Otherwise, a positive 
loss is incurred. The idea behind this formulation is that not only do we want to make a correct prediction, but 
we want to make a confident prediction; this is why we have a loss unless the prediction is correct with some 
margin. 

Support vector machines (SVM) tackle the following optimization problem: 

M 

w = argmin ^ £(w;x™,y™) + -[lifll^ (1.34) 



where we also use the squared L2-norm as the regularizer. For the 0/1-cost, the problem in Eq. 1.34 is equiva- 
lent to: 

argmin Y.f,=iU + h\\M? (1-35) 

s.t. w/(x'",y'") >z(;-/(x'«,y™)+l-^m, Vm,y™ e ^ \ {y""}. (1.36) 

Geometrically, we are trying to choose the linear classifier that yields the largest possible separation margin, 
while we allow some violations, penalizing the amount of slack via extra variables ^i, . . . , ^m- There is now a 
trade-off: increasing the slack variables makes it easier to satisfy the constraints, but it will also increase the 
value of the cost function. 



Problem 1.34 does not have a closed form solution. Moreover, unlike maximum entropy models, the ob- 



jective function in 1.34 is non-differentiable, hence smooth optimization is not possible. However, it is still 
convex, which ensures that any local optimum is the global optimum. Despite not being differentiable, we can 
still define a subgradient of the objective function (which generalizes the concept of gradient), which enables us 



to apply subgradient-based methods. A stochastic subgradient algorithm for solving Eq. 1.34 is illustrated as 
Alg.|6] The similarity with maximum entropy models (Alg.|5} is striking: the only difference is that, instead of 
computing the feature vector expectation using the current model, we compute the feature vector associated 
with the cost-augmented prediction using the current model. 



A variant of this algorithm was proposed by |Shalev-Shwartz et al. ( |2007 ft under the name Pegasos, with ex 



cellent properties in large-scale settings. Other algorithms and software packages for training SVMs that have 
become popular are SVMLight (http://svmlight.joachims.org) and LIBSVM ( http : / / www . csieT] 
[ntu . edu . tw/ - c jlin/libsvm/ 1, which allow non-linear kernels. These will generally be more suitable for 
smaller datasets, where high accuracy optimization can be obtained without much computational effort. 



Remark 1.3 Note the similarity between the stochastic (sub-)gradient algorithms (A/gs. |5|j6| and the online algorithms 
seen above (perceptron and MIRA). 

Exercise 1.5 Implement the SVM primal algorithm (Hint: look at the models implemented earlier, you should only need 
to change a few lines of code). Do this by creating a file SVM.py and implement class SVM. Then, repeat the MaxEnt 
exercise now using SVMs, for several values of \: 
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import Ixmls . classifiers . svm as svmc 



svm = svmc. SVM () 

svm. regularizer = 1.0 § This is lambda 
params_svm_sd = svm. train (sd . train_X, sd. train_y) 
y_pred_train = svm. test (sd . train_X, params_svm_sd) 
acc_train = svm. evaluate (sd. train_y, y_pred_train) 
y_pred_test = svm . test (sd . test_X, params_svm_sd) 
acc_test = svm . evaluate ( sd . test_y, y_pred_test) 
print "SVM Online Simple Dataset Accuracy train: %f test: 



fig, axis 



%f"% (acc_train, acc_test ) 
sd. add_line ( fig, axis , params_svm_sd, "SVM", "orange" ) 



params_svm_sc = svm. train (scr . train_X, scr . train_y) 
y_pred_train = svm. test (scr . train_X, params_svm_sc) 
acc_train = svm . evaluate (scr . train_y, y_pred_train) 
y_pred_test = svm . test (scr . test_X, params_svm_sc) 
acc_test = svm. evaluate ( scr . test_y, y_pred_test) 

p..- -,at "SVM Online Amazon Sentiment Accuracy train: %f test: %f"% (acc_train, acc_test) 



Compare the results achieved and separating hiperplanes found. 



1.5 Comparison 

Table [LTj provides a high-level comparison among the different algorithms discussed in this chapter. 





Naive Bayes 


Perceptron 


MIRA 


MaxEnt 


SVMs 


Generative /Discriminative 


G 


D 


D 


D 


D 


Performance if true model 


Bad 


Fair (may 


Good 


Good 


Good 


not in the hipothesis class 




not converge) 








Performance if features overlap 


Fair 


Good 


Good 


Good 


Good 


Training 


Closed Form 


Easy 


Easy 


Fair 


Fair 


Hyperparameters to tune 


1 (smoothing) 


0 


1 


1 


1 



Table 1.1: Comparison among different algorithms. 



Exercise 1.6 • Using the simple dataset run the different algorithms varying some characteristics of the data: like 
the number of points, variance (hence separability), class balance. Use function run.alLclassifiers in file lab- 
s/run_alLclassifiers.py which receives a dataset and plots all decisions boundaries and accuracies. What can you 
say about the methods when the amount of data increases? What about when the classes become too unbalanced? 

1.6 Final remarks 

Some implementations of the discussed algorithms are available on the Web: 



• SVMLight: http : // svmlight . joachims . org 



LIBSVM: |http : / / www . csie . ntu . edu . tw/ ~c jlin/libsvm/| 



Maximum Entropy: http : / /homepages . inf . ed . ac. uk./lzhanglO/maxent_toolkit . html 



MALLET: |http: //mallet . cs .umass . edu/| 
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Day 2 

Sequence Models 



In this class, we relax the assumption that the data points are independently and identically distributed (i.i.d.) 
by moving to a scenario of structured prediction, where the inputs are assumed to have temporal or spacial 
dependencies. We start by considering sequential models, which correspond to a chain structure: for instance, 
the words in a sentence. In this lecture, we will use part-of-speech tagging as our example task. 

We start by defining the notation for this lecture in Section [ZT] Afterwards, in section 2.2 we focus on 



the well known Hidden Markov Models and in Section 2.3 we describe how to estimate its parameters from 
labeled data. In Section '. 



2.4 



we explain the inference algorithms (Viterbi and Forward-Backward) for sequence 
models. These inference algorithms will be fundamental for the rest of this lecture, as well as for the next 
lecture on discriminative training of sequence models. In Section 2.6 we describe the task of Part-of-Speech 
tagging, and how the Hidden Markov Models are suitable for this task. Finally, in Section 2.7 we address 
unsupervised learning of Hidden Markov Models through the Expectation Maximization algorithm. 



Today's assignment 

The assignment of today's class is to implement one inference algorithm for Hidden Markov Models, used to 
find the most likely hidden state sequence given an observation sequence. 



2.1 Notation 

In what follows, we assume a finite set of observation labels, S := {wi, . . . ,ivj}, and a finite set of state labels, 
A := {cj, . . . , cx}. We denote by S*, A* the two infinite sets of sequences obtained by grouping the elements 
of each label set including repetitions and the empty string ^ Elements of S* and A* are strings of observations 
and strings of states, respectively. Throughout this class, we assume our input set is X = E*, and our output 
set is y = A*. In other words, our inputs are observation sequences, x = x\Xi . . . xjv, for some N e N, where 
each X, e E; given such an x, we seek the corresponding state sequence, y = i/ij/2 • ■ ■ J/N/ where each y, e A. 
We also consider two special states: the start symbol, which starts the sequence, and the stop symbol, which 
ends the sequence. 

Moreover, in this lecture we will assume two scenarios: 

1. Supervised learning. We will train models from a sample set of paired observation and state sequences, 
DL:={(xi,yi),...,(x^,y^)}CXxy. 

2. Unsupervised learning. We will train models from the set of observations only, := {x^, . . . , x^} C X. 
Our notation is summarized in Table IZTI 



2.2 Hidden Markov Models 

Hidden Markov Models (HMMs) are one of the most common sequence probabilistic models, and have been 
applied to a wide variety of tasks. HMMs are particular instances of directed probabilistic graphical models (or 
Bayesian networks) which have a chain topology. In a Bayesian network, every random variable is represented 

^More formally, we say Z* :— {e} U E U U . . . and A* := {e} U A U U . . . is the Kleene closure of each of the two sets above. 
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Notation 




training set (including labeled data) 




!■■ 1/ 11111l 1\ 

training set (unlabeled data only) 


M 


number of training examples 


X — Xi...Xf] 


observation sequence 




state sequence 


N 


length ot the sequence 


Xi 


observation at position t m the sequence, i E |1, . . . , A; j 


Vi 


state at position i m the sequence, i E |1, . . . , A; j 




1 L' L 

observation set 


/ 


number of distinct observation labels 


Wj 


particular observation, j E {1,. . . ,}} 


A 


state set 


K 


number of distinct state labels 


C/c 


particular state, k e { 1, . . . , X} 



Table 2.1: General notation used in this class 



as a node in a graph, and the edges in the graph are directed and represent probabilistic dependencies between 
the random variables. For an HMM, the random variables are divided into two sets, the observed variables, 
X = Xj . . . X]v, and the hidden variables Y = . . . Yjv. In the HMM terminology, the observed variables are 
called observations, and the hidden variables are called states. The states are generated according to a first order 
Markov process, in which the i*^ state Y, depends only of the previous state Y,_x- Two special states are the 
start symbol, which starts the sequence, and the stop symbol, which ends the sequence. In addition, states 
emit observation symbols. In an HMM, it is assumed that all observations are independent given the states 
that generated them. 

As you may find out with today's assignment, implementing the inference routines of the HMM can be 
challenging. We start with a small and very simple (also very unrealistic!) example. The idea is that you may 
compute the desired quantities by hand and check if your implementation yields the correct result. 

Example 2.1 Consider a person who is only interested in four activities: walking in the park (walk), shopping (shop), 
cleaning the apartment (clean) and playing tennis (tennis). Also, consider that the choice of what the person does on 
a given day is determined exclusively by the weather on that day, which can be either rainy or sunny. Now, supposing 
that we observe what the person did on a sequence of days, the question is: can we use that information to predict the 
weather on each of those days 1 To tackle this problem, we assume that the weather behaves as a discrete Markov chain: 
the weather on a given day depends only on the weather on the previous day. The entire system can be described as an 
HMM. 

For example, assume we are asked to predict the weather conditions on two different sequences of days. During these 
two sequences, we observed the person performing the following activities: 

• "walk walk shop clean" 

• "clean walk tennis walk" 
This will be our test set. 

Moreover, and in order to train our model, we are given access to three different sequences of days, containing both 
the activities performed by the person and the weather on those days, namely: 

• "walk/rainy walk/sunny shop/sunny clean/sunny" 

• "walk/rainy walk/rainy shop/rainy clean/sunny" 

• "walk/sunny shop/sunny shop/sunny clean/sunny" 
This will be our training set. 



Figure 2.2 shows the HMM model for the first sequence of the training set, which already includes the start and 



stop symbols. The notation is summarized in Table 2.2 



Exercise 2.1 Load the simple sequence dataset. From the ipython command line create a simple sequence object and look 
at the training and test set. 
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rainy sunny sunny sunny 



o— o 



o o o o 

shop 




walk 



walk 



clean 



Figure 2.1: Diagram showing the conditional independence relations of the HMM. As an example, the variables 
are the values of the first sentence of the training set of the simple sequence. 



HMM Notation 


X 


observed sequence "walk walk shop clean" 


N = 4 


observation length 


i 


position in the sentence: z € {1 . . . N} 


S = {walk, shop, clean, tennis} 


observation set 


i 


index into the observation set j E {!,...,]} 


Xj = Wj 


observation at position i has value Wj 


A — {rainy, sunny} 


state set 


k 


index into state set A: e {1, . . . , X} 


= Ck 


state at position i has value cj- 



Table 2.2: HMM notation for the simple example. 



»> import Ixmls . readers . simple_sequence as ssr 
»> simple = ssr . SimpleSequence ( ) 
>>> pr±i. simple .train 

[walk/rainy walk/sunny shop/ sunny clean/ sunny , walk/rainy walk/rainy shop/rainy clean/ 

sunny , walk/ sunny shop /sunny shop/ sunny clean/ sunny ] 
»> print simple, test 

[walk/ rainy walk/ sunny shop/ sunny clean/ sunny , clean/sunny walk/ sunny tennis/sunny walk/ 
sunny ] 



Get in touch with the classes used to store the sequences, you will need this for the next exercise. Note that each label is 
internally stored as a number. This number can be used as index of an array to store information regarding that label. 



»> for sequence simple. train 


seq_list : 


print sequence 




walk/rainy walk/sunny shop/ sunny 


clean/ sunny 


walk/rainy walk/ rainy shop/rainy 


clean/ sunny 


walk/sunny shop/ sunny shop/ sunny 


clean/sunny 


»> for sequence simple, train 


seq_list : 


print sequence. X 




[0, 0, 1, 2] 




[0, 0, 1, 2] 




[0, 1, i, 2] 




»> for sequence in simple, train 


seq_list : 


print sequence. y 




[0, 1, i, 1] 




[0, 0, 0, 1] 




[1, 1, 1, 1] 





The probability distributions P{Yi\yi-i) are called transition probabilities; the distributions P{Yi\Yq = start) 
are the initial probabilities, and -P(yjv+i = stop| Yjv) the final probabilities^FinaWy, the distributions P(X;|Y,) are 



^Note that the initial and final probabilities are asymmetric. 
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called emission probabilities. 

A first order HMM model has the following independence assumptions over the joint distribution P{X = 

x,Y = y): 

• Independence of previous states. The probability of being in a given state at position i only depends on 
the state of the previous position f — 1 . Formally, 

PO^i = yil^i-i = Vi~\,^i-i = y\-i,---,^\ = 3/1) = P'y^i = ViVi-\ = Vi-i) 

defining a first order Markov chain|^ 

• Homogeneous transition. The probability of making a transition from state c; to state cj. is independent 
of the particular position in the sequence. That is, for all i,t E {!,..., N}, 

PiYr = Ck\Yi-i = c,) = P{Yt = c,|Yf_i = ci) 

• Observation independence. The probability of observing X, = x, at position i is fully determined by the 
state Y; at that position. Formally, 

P{Xi = Xi\Yi =y-^,...,Yj= yi, . . . , Yn = yjv) = P(X/ = x/| Y,- = y,) 

This probability is independent of the particular position so, for every i and t, we can write: 

P(X,- = zVj\Yi = Cfc) = P{Xt = zVj\Yt = 

These conditional independence assumptions are crucial to allow efficient inference, as it will be described. 

The distributions that define the HMM model are summarized in Table 12.31 For each one of them we will 
use a short notation to simplify the exposition. 



HMM distributions 


Name 


probability distribution 


short notation 


array size 


initial probability 


^^(^1 = Ck\Yo = start) 


f'mit(c)c start) 


K 


transition probability 


P{Y, = c,\Yi_, = ci) 


Arans(c;tk/) 


KxK 


final probability 


P{yN+i = stoplYjv = q.) 


f'fmal(stop|Cjt) 


K 


emission probability 


P(X, = Wj\Yi = Ck) 


-Pemiss(^/|'^/c) 


J xK 



Table 2.3: HMM probability distributions. 
The joint distribution can be expressed as: 

P(Xi = xi, . . . , Xjv = X]v, Yi = 1/1, . . . , Yn = 1/n) = 

/N-l \ N 

f'mit(l/l|start) X Yl Arans(l/i+l|l//) X -Pfoal ( St op |i/n) X ^emiss (^i |l//)/ 



(2.1) 



i=l 



(=1 



which for the example of Figure 2.1 



is: 



P(Xi = xi, . . . , X4 = X4, Yl = yi, . . . , Y4 = y4) = 

Pinit(rainy| start) X Ptrans(sunny|rainy) X Ptrans(sunny| sunny) X Ptrans(sunny| sunny) X 
-Pfinal(stop| sunny) X Pemiss(w3-lk| rainy) X Peiniss(wslk| sunny) X Penuss(shop| sunny) 
X-Pemiss(clean|sunny). (2.2) 

In the next section we turn our attention to estimating the different probability distributions of the model. 



^The order of the Markov chain depends on the number of previous positions taken into account. The remainder of the exposition can 
be easily extended to higher order HMMs, giving the model more generality, but making inference more expensive. 
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2.3 Finding the Maximum Likelihood Parameters 



One important problem in HMMs is to estimate the model parameters, i.e., the distributions depicted in Ta- 



ble 2.3 We will refer to the set of all these parameters as 9. In a supervised setting, the HMM model is trained 
to maximize the joint log-likelihood of the data. Given a dataset Ti^, the objective being optimized is: 



M 



argmax ^ logPe(X = Y 



(2.3) 



m=l 



2.1 



where Pe{X = x"^ ,Y = y"') is given by Eq. 

In some applications {e.g. speech recognition) the observation variables are continuous, hence the emission 
distributions are real-valued {e.g. mixtures of Gaussians). In our case, both the state set and the observation 
set are discrete (and finite), therefore we use multinomial distributions for the emission and transition prob- 
abilities. Multinomial distributions are attractive for several reasons: first of all, they are easy to implement; 
secondly, the maximum likelihood estimation of the parameters has a simple closed form. The parameters are 
just normalized counts of events that occur in the corpus (the same as the Najfve Bayes from previous class). 

Given our labeled corpus D^, the estimation process consists of counting how many times each event occurs 
in the corpus and normalize the counts to form proper probability distributions. Let us define the following 
quantities, called sufficient statistics, that represent the counts of each event in the corpus: 



M 



Initial Counts: Cu-dt(cjt) = liyf = c^); 

m=l 



M N 



Transition Counts: Ctrans(cfc,c;) = X] X] !(!/"' = A ^ = q); 

m=l 1=2 

M 



Final Counts: Cfinai(c)c) = ^ l(yN = ^it); 



M N 



Emission Counts: Cemiss{^j,Ck) = J2 JL'^i^T 

m=l i=l 



Wj A y- 



Ck); 



(2.4) 
(2.5) 
(2.6) 
(2.7) 



Here y™, the underscript denotes the state index position for a given sequence, and the superscript denotes the 
sequence index in the dataset, and the same applies for the observations. Note that 1 is an indicator function 
that has the value 1 when the particular event happens, and zero otherwise. In other words, the previous 
equations go through the training corpus and count how often each event occurs. For example, Eq. 2.5 counts 



how many times cj- follows state c;. Therefore, Ctrans( sunny, rainy) contains the number of times that a sunny 
day followed a rainy day. 

After computing the counts, one can perform some sanity checks to make sure the implementation is cor- 
rect. Summing over all entries of each count table we should observe the following: 

• Initial Counts - Should sum to the number of sentences: Ef=i Qrat(ct) = 

• Transition/Final Counts - Should sum to the number of tokens: I^f/^i Ctrans(Cjt,C;) -|- I^f^j Qmal('^A;) = 
MN 

• Observation Counts - Should sum to the number of tokens: Ef=i Cemissi'^j'^k) = MN. 
Using the sufficient statistics (counts) the parameter estimates are: 

Cinit(C/c) 



finit(cfc|start) 

fimaKstopIc;) 
Ptmns{Ck\ci) 
Permssi^^jl'^k) 



E/=l Cinit(c/) 



Qmal(c;) 



Ejc=l Ctrans (Cfc,C;) -|- C£inal(c;) 



Ctrans(C)c/ C/) 



Ep=l Qrans (Cp,C;) + Cfinai(c;) 



Efl=i Cemiss(^q/ '^/r) 



(2.8) 
(2.9) 
(2.10) 

(2.11) 
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Exercise 2.2 The provided function train_supervised/roOT the hmm.py file implements the above parameter estimates. 
Run this function given the simple dataset above and look at the estimated probabilities. Are they correct? You can also 
check the variables ending in _counts instead of _probs to see the raw counts (for example, typing hmm.initiaLcounts 
will show you the raw counts of initial states). How are the counts related to the probabilities? 



>>> import Ixmls . sequences . hmm hmmc 

»> hmm = hmmc . HMM (simple . x_dict, simple . y_dict ) 

>>> hmim . train_supervised (simple . train) 

»> print "Initial Probabilities :" , hmm. initial_probs 
[ 0.66666667 0.33333333] 

»> print "Transition Probabilities:", hmm. trans it ion_probs 
[[ 0.5 0. ] 

[0.5 0.625] ] 
»> print "Final Probabilities:", hmm. final_probs 
[ 0. 0.375] 

>>> print "Emission Probabilities", hmm. emission_probs 
[[0.75 0.25 ] 

[0.25 0.375] 

[ 0. 0.375] 

[ 0. 0. ]] 



2.4 Decoding a Sequence 

Given the learned parameters and a new observation sequence x = xi . . . x^, we want to find the sequence of 
hidden states y* =y\ . . . that "best" explains it. This is called the decoding problem. There are several ways 
to define what we mean by the "best" y* , depending on our goal: for instance, we may want to minimize the 
probability of error on each hidden variable Y,, or we may want to find the best assignment to the sequence 
Yi . . . Y]v as a whole. Therefore, finding the best sequence can be accomplished through different approaches: 

• A first approach, normally called posterior decoding or minimum risk decoding, consists in picking the 
highest state posterior for each position i in the sequence: 

y* = argmaxP(Y; = i/,|Xi = xi,...,Xn = x^). (2.12) 

Note, however, that this approach does not guarantee that the sequence y* = . . . will be a valid 
sequence of the model. For instance, there might be a transition between two of the best state posteriors 
with probability zero. 

• A second approach, called Viterbi decoding, consists in picking the best global hidden state sequence: 

y* = argmaxP(Yi =yi,...,YN =yN|Xi =xi,...,X]v = xn) 

y=yi-yN 

= argmaxP(Yi = yi,. . .,Yn = yw/Xi = xi,. ..,Xjv = xn). (2.13) 

y=yi-yN 



Both approaches (which will be explained in Sections 2.4.2 and 2.4.3 respectively) rely on d5mamic pro- 
gramming and make use of the independence assumptions of the HMM model. Moreover, they use an alter- 
native representation of the HMM called a trellis. 

A trellis unfolds all possible states for each position and it makes explicit the independence assumption: 
each position only depends on the previous position. Here, each column represents a position in the sequence 



and each row represents a possible state. Figure 2.2 shows the trellis for the particular example in Figure 2.1 



Considering the trellis representation, note that we can include the following information: 

• an initial probability to the arrows that depart from the start symbol; 

• a final probability to the arrows that reach the stop symbol; 

• a transition probability to the remaining arrows; and. 
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rainy rainy rainy rainy 




sunny sunny sunny sunny 



Figure 2.2: Trellis representation of the HMM in Figure 2.1 for the observation sequence "walk walk shop 



clean", where each hidden variable can take the values rainy or sunny. 

• an emission probability to each circle, which is the probability that the observed symbol is emitted by that 
particular state. 

For convenience, we will be working with log-probabilities, rather than probabilities |^ Therefore, if we 
associate to each circle and arrow in the trellis a score that corresponds to the log-probabilities above, and if we 
define the score of a path connecting the start and stop symbols as the sum of the scores of the circles and 
arrows it traverses, then the goal of finding the most likely sequence of states (Viterbi decoding) corresponds 
to finding the path with the highest score. 

The trellis scores are given by the following expressions: 

• For each state cj^: 

scoreinit(c)c) = logPinit(yi = C)c|start). (2.14) 

• For each position z G 1, . . . , Af — 1 and each pair of states c^- and c;: 

SCOretrans(!,C/c/C;) = log Ptrans C>^,+1 = C,t l^^^i = Q)- (2-15) 

• For each state C/ (verify equation: remove i from score?): 

SCOrefinal(!,C/) = logPfinal(stop|Yjv = q). (2.16) 

• For each position i E 1, . . . , N and state Cj^: 

SCOreemiss(2/Cfc) = log Pemiss (X^, = X,- 1 Y, = C;t)- (2-17) 

In the next exercise, you will compute the trellis scores. 
Exercise 2.3 Convince yourself that the score of a path in the trellis (summing over the scores above) is equivalent to the 



log-probability logP(X = x,Y = y), as defined in Eq. 2.2 Use the given function compute_scores on the first training 



sequence and confirm that the values are correct. You should get the same values as presented below. 



>>> initial_scores, transition_scores, final_scores, emission_scores = hmm. compute_scores 

(simple . train . seq_list [ 0 ] ) 
»> print initial_scores 
[-0.40546511 -1.09861229] 
»> print transition_scores 
[[ [-0.69314718 -inf] 
[-0.69314718 -0.47000363] ] 

[ [-0.69314718 -inf] 
[-0.69314718 -0.47000363] ] 

[ [-0.69314718 -inf] 
[-0.69314718 -0.47000363] ] ] 
»> print final_scores 
[ -inf -0. 98082925] 



*This will be motivated further in Section 2.4.1 where we describe how operations can be performed efficiently in the log-domain. 
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»> print emission_scores 
[ [-0.28768207 -1 .38629436] 
[-0 . 28768207 -1 . 38629436] 
[-1 .38629436 -0.98082925] 
[ -inf -0.98082925] ] 



Note that scores which are -co (-inf) correspond to zero-probability events. 
2.4.1 Computing in log-domain 

We will see that the decoding algorithms will need to multiply twice as many probability terms as the length 
N of the sequence. This may cause underflowing problems when N is large, since the nested multiplication of 
numbers smaller than 1 may easily become smaller than the machine precision. To avoid that problem, jRabiner 
( [1989 ) presents a scaled version of the decoding algorithms that avoids this problem. An alternative, which 
is widely used, is computing in the log-domain. That is, instead of manipulating probabilities, manipulate 
log-probabilities (the scores presented above). Every time we need to multiply probabilities, we can sum their 
log-representations, since: 

log(exp(fl) X exp{b)) = a + b. (2.18) 
Sometimes, we need to add probabilities. In the log domain, this requires us to compute 

log(exp(fl) + exp(&)) =fl-hlog(l + exp(f7-fl)), (2.19) 

where we assvime that a is smaller than b. 

Exercise 2.4 Look at the module sequence s/log_domain . py. This module implements a function logsum_pair (logx, 
logy) to add two numbers represented in the log-domain; it returns their sum also represented in the log-domain. The 
function logsum ( logv) sums all components of an array represented in the log-domain. This will be used later in our 
decoding algorithms. To observe why this is important, type the following: 

»> import numpy np 

»> a = np . random, rand (10) 

»> np . log (sum (np . exp (a ) ) ) 

2.83971 72643228661 

>>> np . log (sum (np . exp ( 1 0 *a) ) ) 

10 . 12109991 7705818 

>>> np . log (sum (np . exp (100 *a) ) ) 

93 . 159220940569128 

»> np . log (sum (np . exp ( 1000 *a )) ) 

inf 

»> from Ixmls . sequences . log_domain import * 

»> logsum (a) 

2.83971 72643228665 

>>> logsum (10*a) 

10 . 12109991 7705818 

>>> logsum (100*a) 

93 . 159220940569114 

>>> logsum (1000*a) 

925. 88496219586864 



2.4.2 Posterior Decoding 

Posterior decoding consists in picking state with the highest posterior for each position in the sequence inde- 
pendently; for each i = 1, . . . ,N: 

y* = argmaxP(Y; = y,|X = x). (2.20) 

The sequence posterior distribution is the probability of a particular hidden state sequence given that we have 
observed a particular sequence. Moreover, we will be interested in two other posteriors distributions: the state 
posterior distribution, corresponding to the probability of being in a given state in a certain position given the 
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observed sequence; and the transition posterior distribution, which is the probability of making a particular 
transition, from position i to i + 1, given the observed sequence. They are formally defined as follows: 

P(X = X Y = v) 

Sequence Posterior: P{Y = y\X = x) = p(^x - x) ^^'^'^'^ 

State Posterior: P(y,- = i/,|X = x); (2.22) 

Transition Posterior : P(Y,_|.i = i//+i,y,' = i/, |X = x). (2.23) 

To compute the posteriors, a first step is to be able to compute the likelihood of the sequence P{X = x), 
which corresponds to summing the probability of all possible hidden state sequences. 

Likelihood: P(X = x) = P{X = x,Y = y). (2.24) 

The number of possible hidden state sequences is exponential in the length of the sequence (lAj'^), which 
makes the sum over all of them hard. In our simple example, there are 2* = 16 paths, which we can actually 
explicitly enumerate and calculate their probability using Equation |2.1| But this is as far as it goes: for example, 
for Part-of-Speech tagging with a small tagset of 12 tags and a medium size sentence of length 10, there are 
12^*^ = 61917364224 such paths. Yet, we must be able to compute this sum (sum over y G A'^) to compute 
the above likelihood formula; this is called the inference problem. For sequence models, there is a well known 
dynamic programming algorithm, the Forward-Backward (FB) algorithm, which allows the computation to 
be performed in linear timej^by making use of the independence assumptions. 

The FB algorithm relies on the independence of previous states assumption, which is illustrated in the trellis 
view by having arrows only between consecutive states. The FB algorithm defines two auxiliary probabilities, 
the forward probability and the backward probability. 

Forward Probability : forward(f,c/J = P(Y; = cj.,Xi = xj,. ..,X,- = x,) (2.25) 

The forward probability represents the probability that in position i we are in state Y, = cj. and that we have 
observed xi, . . . , x; up to that position. The forward probability is defined by the following recurrence rule: 

forward(l,Cfc) = Pinit(C;t| start) x Pemiss(^l kfc) (2.26) 

forward(f,Ct) = Ptrans(Cfc|c;) X forward(f - 1, C;) X Pemiss(^/kjc) (2.27) 

forward(N + 1, stop) = ^ Pfinai(stop|c;) x forward(N,c;). (2.28) 

Using the forward trellis one can compute the likelihood simply as: 

P(X = x) = forward(]V + 1, stop). (2.29) 

Although the forward probability is enough to calculate the likelihood of a given sequence, we will also 
need the backward probability to calculate the state posteriors. The backward probability is similar to the 
forward probability, but operates in the inverse direction. It represents the probability of observing X/+i, . . . , Xjv/ 
from position f + 1 up to N, given that at position i we are at state Y, = Cf. 

Backward Probability: backward(;, c;) = P{Xj_^_■^ = X|_^_■^, . . . ,Xn = xj^\Yj = c;). (2.30) 

The backward recurrence is given by: 

backward (N,c;) = -Pfir,ai(stop|c;) (2.31) 
backward(f,c;) = ^ Ptrans(c/t|c/) x backward(f + 1, c,t) x ^'emiss(^;+i (2.32) 

backward(0. Start) = ^ Pinit(Cyt|start) x backward(l,Cjt) x Pgnussl^lkjc)- (2.33) 



^The runtime is linear with respect to the sequence length. More precisely, the runtime is 0(N| A|^). A naive enumeration would cost 
0(|A|~). 
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Algorithm 7 Forward-Backward algorithm 



input: sequence Xi, . . . , X^, scores Pmit, -Ptrans, f'fmal/ f'emiss 

2: Forward pass: Compute the forward probabilities 
3: Initialization 
4: for e A do 

5: forward(l,Cfc) = Pinit(Cfc| start) X Pemiss(^l |c/c) 
6: end for 

for z = 2 to N do 
for c;. e A do 

9: forward(f,Cfc) = Arans(Cjt|c;) X forward(f - 1,C/) X Pemiss(^ik)c) 

Vc,6A / 

10: end for 
11: end for 

12: Backward pass: Compute the backward probabilities 
13: Initialization 
14: for C; e A do 

15: backward (N,c;) = -Pfmal(stop|c/) 
16; end for 

17: for z = N - 1 to 1 do 

18: backward (/,C;) = J2 Arans(Cit|c;) X backward(f + l,C,t) X Pemiss(^i+lkfc) 

19: end for 

20: output: The forward and backward probabilities. 

Using the backward trellis one can compute the likelihood simply as:. 

P(X = x) = backward(0, start). (2.34) 

Moreover, with the forward and backward probabilities, we can compute the likelihood of a given sequence 
using any position i in the sequence as follows: 

P{X = x) = ^ P(Xi =xi,...,XN = XN,Y; = Cfc) 

= y P(Xi =xi,...,Xi= Xi, Yi = c^) X P(X;+i = Xi+i, . . . , Xn = xjvlY/ = c^) 
c,eA^ ^ ' ^ ^ ' 

forward(i,Ci-) backward(/,cj-) 

= forward ( f, cj-) X backward c/f). (2.35) 

This equation will work for any choice of i. Although redundant, this fact is useful when implementing 
an HMM as a sanity check that the computations are being performed correctly, since one can compute this 
expression for several i; they should all yield the same value. 

Algorithm |7]shows the pseudo code for the forward backward algorithm. 

Exercise 2.5 Run the provided forward-backward algorithm on the first train sequence. Observe that both the forward 
and the backward passes give the same log-likelihood. 



>>> log_likelihood, forward = hmm. decoder . run_forward ( initial_scores, transition_scores, 

final_scores , emission_scores ) 
»> print 'Log-Likelihood =' , log_likelihood 
Log-Likelihood = -5 . 0 6823232 601 

»> log_likelihood, backward = hmm. decoder . run_backward (initial_scores, transition_scores 

, final_scores , emission_scores) 
»> print 'Log-Likelihood =', log_likelihood 
Log-Likelihood = -5 . 06823232601 



Given the forward and backward probabilities, one can compute both the state and transition posteriors 
(you can hint why by looking at the term inside the sum in Eq. 2.35[ . 
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rainy rainy rainy rainy rainy 




-Ptrans { Sunny | Sunny ) P^mi^^ (shop | sunny) 



Figure 2.3: A graphical representation of the components in the state and transition posteriors. Recall that the 
observation sequence is "walk walk shop clean". 



„ . r,/,, I,, N forwardff, 1//) X backward(f, 1/;) 
State Posterior: P(Y,- = i/,|X = x) = P{X - x) ^ ^ ^ 



Transition Posterior: P(Y; = y,, Yj+i = i//+i|X = x) = 

forward (f,i/,) x Ptrans (3/1+1 Ij//) X -Pemiss(^i+i|yi+i) X backward(f + 

P{X = x) 



(2.37) 



A graphical representation of these posteriors is illustrated in Figure |2.3| On the left it is shown that 
forward x backward (f,y,) returns the sum of all paths that contain the state y„ weighted by P{X = x); 
on the right we can see that forward(f,y,) x PtransCyi+ilyO x -Pemiss(^/+i|y!+i) x backward(; + l,y,+i) returns 
the same for all paths containing the edge from y/ to y/+i. 

As a practical example, given that the person performs the sequence of actions "walk walk shop clean", 
we want to know the probability of having been raining in the second day. The state posterior probability for 
this event can be seen as the probability that the sequence of actions above was generated by a sequence of 
weathers and where it was raining in the second day. In this case, the possible sequences would be all the 
sequences which have rainy in the second position. 

Using the state posteriors, we are ready to perform posterior decoding. The strategy is to compute the state 
posteriors for each position i E {!,..., N} and each state cj- e A, and then pick the arg-max at each position: 

y,- := argmaxP(Y,- = y,|X = x). (2.38) 

Exercise 2.6 Compute the node posteriors for the first training sequence (use the provided compute_posteriors/wnc- 
tion), and look at the output. Note that the state posteriors are a proper probability distribution (the lines of the result 
sum to 1). 

>>> initial_scores, transition_scores, final_scores, emission_scores = hmm. compute_scores 
(simple . train . seq_list [ 0 ] ) 

>>> state_posteriors, _, _ = hmm. compute_posteriors (initial_scores, 

transition_scores , 
final_scores , 
emi ssion_s cores ) 

»> print state_posteriors 
[[ 0.95738152 0.04261848] 

[ 0.75281282 0.24718718] 

[ 0.26184794 0.73815206] 

[ 0. 1. ]] 



Exercise 2.7 Run the posterior decode on the first test sequence, and evaluate it. 

>>> y_pred = hmm. posterior_decode ( simple . test . seq_list [ 0 ] ) 
»> print "Prediction test 0;", y_pred 
walk/rainy walk/rainy shop/sunny clean/sunny 
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»> print "Truth test 0:", simple . test . seq_list [ 0 ] 
walk/rainy walk/sunny shop/sunny clean/sunny 



Do the same for the second test sequence: 

>>> y_pred = hmm . posterior_decode ( simple . test . seq_list [ 1 ] ) 
»> print "Prediction test 1 : ", y_pred 
clean/rainy walk/rainy tennis/rainy walk/rainy 
»> print "Truth test 1 : " , simple . test . seq_list [ 1 ] 
clean/sunny walk/sunny tennis/sunny walk/sunny 



What is wrong? Note the observations for the second test sequence: the observation tennis was never seen at 
training time, so the probability for it will be zero (no matter what state). This will make all possible state sequences have 
zero probability. As seen in the previous lecture, this is a problem with generative models, which can be corrected using 
smoothing (among other options). 

Change the train_supervised method to add smoothing: 

det train_sup6rvis6d (:-:eif , sequence_list , smoothing) : 



Try, for example, adding 0.1 to all the counts, and repeating this exercise with that smoothing. What do you observe? 

>>> hmm. train_supervised (simple . train, smoothing=0 . 1) 

»> y_pred = hmm. posterior_decode ( simple . test . seq_list [ 0 ] ) 

»> print "Prediction test 0 with smoothing:", y_pred 

walk/rainy walk/rainy shop/sunny clean/sunny 

>>> print "Truth test 0;", simple . test . seq^list [ 0 ] 

walk/rainy walk/sunny shop/sunny clean/sunny 

>>> y_pred = hmm. posterior_decode ( simple . test . seq_list [ 1 ] ) 

»> print "Prediction test 1 with smoothing:", y_pred 

clean/sunny walk/sunny tennis/sunny walk/sunny 

»> print "Truth test 1:", simple . test . seq_list [ 1 ] 

clean/sunny walk/sunny tennis/sunny walk/sunny 



2.4.3 Viterbi Decoding 

Viterbi decoding consists in picking the best global hidden state sequence y as follows: 

y = argmaxP(Y = y\X = x) = argmaxP(X = x,Y = y). (2.39) 

The Viterbi algorithm is very similar to the forward procedure of the FB algorithm, making use of the same 
trellis structure to efficiently represent the exponential number of sequences without prohibitive computation 
costs. In fact, the only difference from the forward-backward algorithm is in the recursion |Z27 where instead 
of summing over all possible hidden states, we take their maximum. 

Viterbi viterbi(f, w,) = max PiY^ = yi,. . . Yj = Vj, Xi — x\,. . . , X,- = x,) (2.40) 

The Viterbi trellis represents the path with maximum probability in position i when we are in state Y; = y,- 
and that we have observed x\, . . . ,x\ up to that position. The Viterbi algorithm is defined by the following 
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Algorithm 8 Viterbi algorithm 



1: input: sequence Xi,..., X^, scores Pmit, -Ptrans, f'fmal/ f'emiss 

2: Forward pass: Compute the best paths for every end state 

3: Initialization 

4: for Cj: e A do 

5: viterbi(l,C/J = Pinit(c,tl start) x Pemiss(^l|cjt) 

6: end for 

7: for z = 2 to N do 

8: for c;. e A do 

9: viterbi(f,Cfc) = ( maxPtrans(c/c|c;) x viterbi(f - l,c;) ) x Pemiss(^/kjc) 

10: backtrack (f,Ci-) = argmaxPtrans(cj:|c;) x viterbi(f — l,c;) 

V c,eA J 

11: end for 

12: end for 

13: maXy^^N Fix = x,Y = y) •— max^^gA Pfmai(stop|c;) x viterbi(N, C/) 

14: 

15: Backward pass: backtrack to obtain the most likely path 

16: = argmax^^g^Pfinai(stop|c;) X viterbi(N,C/) 

17: f or / = N - 1 to 1 do 

18: yi = backtrack(/ + l,y,+i) 

19: end for 

20: output: the viterbi path y. 



recurrence: 

viterbi(l,C;t) = -Pinit(c)c| start) x Pemiss(^lkjt) (2.41) 
viterbi (f,c/J = ( maxPtrans(cfc|c;) x viterbi(f - l,c;) ) x Pemiss(^ik)c) (2.42) 

backtrack (f,c J.) = argmaxPtrans(c)ck/) x viterbi(f — l,c;) (2.43) 
viterbi(]V + 1, stop) = maxPfinai(stop|c;) x viterbi(N,C/) (2.44) 

C;eA 

backtrack(N + 1, stop) = argmaxPfinai(stop|c;) x viterbi(N,c/). (2.45) 

c,eA 

Algorithm |8] shows the pseudo code for the Viterbi algorithm. Note the similarity with the forward algo- 
rithm. The only differences are: 

• Maximizing instead of summing; 

• Keeping the argmax's to backtrack. 



2.5 Assignment 

With the previous theoretical background, you have the necessary tools to solve today's assignment. 

Exercise 2.8 Implement a method for performing Viterbi decoding in file sequence.classif ication_decoder .py. 



def rvin_viterbi (self , initial_scores, transition_scores, final_scores , 
emission_scores) : 



Hint: look at the implementation o/run_f orward. Also check the help for the numpy methods max and argmax. 
This method will be called by 
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aef viterbi_dBCOcie(self, sequence) 



in the module sequence_classif ier . py. 

Test your method on both test sequences and compare the results with the ones given. 

»> hmm. train_supervised (simple . train, smoothing=0 . 1) 

>>> y_pred, score = hmm. viterbi_decode (simple . test . seq_list [ 0 ] ) 

>>> print "Viterbi decoding Prediction test 0 with smoothing:", y_pred, score 

walk/rainy walk/rainy shop/sunny clean/sunny -6.02050124 698 

>>> print "Truth test 0 : " , simple . test . seq^list [ 0 ] 

walk/rainy walk/sunny shop/sunny clean/sunny 

>>> y_pred, score = hmm. viterbi_decode (simple . test . seq_list [ 1 ] ) 
»> print "Viterbi decoding Prediction test 1 with 
smoothing : ", y_pred, score 

clean/sunny walk/sunny tennis/sunny walk/sunny -11.713974074 
»> print "Truth test I;", simple . test . seq^list [ 1 ] 
clean/sunny walk/sunny tennis/sunny walk/sunny 



Note: since we didn't run the train_supervised method again, we are still using the result of the training using 
smoothing. Therefore, you should compare these results to the ones of posterior decoding with smoothing. 

2,6 Part-of-Speech Tagging (POS) 

Part-of-Speech (PoS) tagging is one of the most important NLP tasks. The task is to assign each word a gram- 
matical category, or Part-of-Speech, i.e. noun, verb, adjective,... Recalling the defined notation, E is a vocabu- 
lary of word types, and A is the set of Part-of-Speech tags. 

In English, using the Perm Treebank (PTB) corpus ( [Marcus et al. 
of speech tagging is around 97% for a variety of methods. 

In the rest of this class we will use a subset of the PTB corpus, but instead of using the original 45 tags we 
will use a reduced tag set of 12 tags, to make the algorithms faster for the class. In this task, x is a sentence {i.e., 
a sequence of word tokens) and y is the sequence of possible PoS tags. 

The first step is to load the corpus. We will start by loading 1000 sentences for training and 1000 sentences 
both for development and testing. Then we train the HMM model by maximum likelihood estimation. 



>>> 


import Ixmls . reader s . pos_corpus pec 




>>> 


corpus = pec . PostagCorpus ( ) 




>>> 


train_seq = corpus . read_sequence_list_conll ( "data/train-02-2 1 . 


conll" , max_sent_len=15, 




max_nr_sent=1000) 




>>> 


test_seq = corpus . read_sequence_list_conll ( "data/test-23 . conll 


" , max_sent_len=15, 




max_nr_sent=1000) 




>>> 


dev_seq = corpus . read_sequenoe_list_conll ( "data/dev-22 .conll " , 


max_sent_len=15 , 




inax_nr_sent = l 000 ) 




>>> 


hmm = hmimc . HMM (corpus . word_dict , corpus . tag_dict ) 




>>> 


hmm. train_supervised (train_seq) 




>>> 


hmm . print_transition_matrix ( ) 





Look at the transition probabilities of the trained model (see Figure [Z4| , and see if they match your intuition 
about the English language (e.g. adjectives tend to come before nouns). 

Exercise 2.9 Test the model using both posterior decoding and Viterbi decoding on both the train and test set, using the 
methods in class HMM: 



>>> viterbi_pred_train = hmm. viterbi_decode_corpus (train_seq) 

»> poster ior_pred_train = hmm. poster ior_decode_corpus (train_seq) 

»> eval_viterbi_train = hmm. evaluate_corpus (train_seq, viterbi_pred_train) 

>>> eval posterior_train = hmm. evaluate_corpus (train_seq, posterior_pred_train) 

»> print "Train Set Accuracy: Posterior Decode \%.3f, Viterbi Decode: \%.3f"\%( 
eval_posterior_train, eval_viterbi_train ) 



I, the current state of the art for part 
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Figure 2.4: Transition probabilities of the trained model. Each column is the previous state and row is the 
current state. Note the high probability of having Noun after Determinant or Adjective, or of having Verb after 
Nouns or Pronouns, as expected. 



Train Set Accuracy: Posterior Decode 0.985, Viterhi Decode: 0.985 

>>> viterbi_pred_test = hmm.viterbi_decode_corpus (test_seq) 

>>> poster ior_pred_test = hmm.posterior_decode_corpus (test_seq) 

»> eval_viterbi_test = hmm.evaluate_corpus (test_seq, viterbi_pred_test) 

>>> eval_posterior_test = hmm. evaluate_corpus (test_seq, poster ior_pred_test) 

»> print "Test Set Accuracy: Posterior Decode \%.3f, Viterbi Decode: \%.3f"\%( 

eval_posterior_test , eval_viterbi_test ) 
Test Set Accuracy: Posterior Decode 0.350, Viterbi Decode: 0.509 



What do you observe? Remake the previous exercise but now train the HMM using smoothing. Try different values 
(0,0.1,0.01,1) and report the results on the train and development set. (Use function pick_bestjmoothingj. 

»> best_smoothing = hmm.pick_best_smoothing (train_seq, dev_seq, [10,1,0.1,0]) 

Smoothing 10.000000 — Train Set Accuracy: Posterior Decode 0.731, Viterbi Decode: 0.691 

Smoothing 10.000000 — Test Set Accuracy: Posterior Decode 0.712, Viterbi Decode: 0.675 

Smoothing 1.000000 — Train Set Accuracy: Posterior Decode 0.887, Viterbi Decode: 0.865 

Smoothing 1.000000 — Test Set Accuracy: Posterior Decode 0.818, Viterbi Decode: 0.792 

Smoothing 0.100000 — Train Set Accuracy: Posterior Decode 0.968, Viterbi Decode: 0.965 

Smoothing 0.100000 — Test Set Accuracy: Posterior Decode 0.851, Viterbi Decode: 0.842 

Smoothing 0.000000 — Train Set Accuracy: Posterior Decode 0.985, Viterbi Decode: 0.985 

Smoothing 0.000000 — Test Set Accuracy: Posterior Decode 0.370, Viterbi Decode: 0.526 

»> hmm. train_supervised (train_seq, smoothing=best_smoothing) 

»> viterbi_pred_test = hmm. viterbi_decode_corpus (test_seq) 

>>> poster ior_pred_test = hmm.posterior_decode_corpus (test_seq) 

»> eval_viterbi_test = hmm. evaluate_corpus (test_seq, viterbi_pred_test ) 

>>> eval_posterior_test = hmm . evaluate_corpus (test_seq, poster ior_pred_test ) 

»> print "Best Smoothing \%f Test Set Accuracy: Posterior Decode \%.3f, Viterbi 

Decode : \% . 3f"% (best_smoothing, eval_posterior_test, eval_viterbi_test ) 
Best Smoothing 0.100000 — Test Set Accuracy: Posterior Decode 0.837, Viterbi Decode: 0. 

827 



Perform some error analysis to understand were the errors are coming from. You can start by visualizing the confusion 



matrix (true tags vs predicted tags). You should get something like what is shown in Figure 2.5 



>>> import Ixmls . sequences . con fusion_matrix h- cm 
»> import matplotlib . pyplot as pit 

»> confusion_matrix = cm.build_confusion_matrix (test_seq. seq_list, viterbi_pred_test, 
len ( corpus . tag_dict ) , hmm. get_num_states ( ) ) 
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Figure 2.5: Confusion Matrix for the previous example. Predicted tags are columns and the true tags corre- 
sponds to the constituents of each column. 



2.7 Unsupervised Learning of HMMs 



We next address the problem of unsupervised learning. In this setting, we are not given any labeled data; all we 
get to see is a set of natural language sentences. The underlying question is: 

Can we learn something from raw text? 

This task is challenging, since the process by which linguistic structures are generated is not always clear; and 
even when it is, it is typically too complex to be formally expressed. Nevertheless, unsupervised learning 
has been applied to a wide range of NLP tasks, such as: Part-of -Speech Induc tion ([Schutze 1995{ Merialdo 



1994 Clark 2003 1, Dependency Grammar Induction dKlein and Manning 2004 Smith and Eisner |2006[l. C on 



stituency Grammar Induction ( Klein and Manning , 2004 1, Statistical Word Alignments ( Brown et al. 1993) and 
Anaphora Resolution ( Charniak and Elsnerj|2009 l, just to name a few. 

Different motivations have pushed research in this area. From both a linguistic and cognitive point of view, 
unsupervised learning is useful as a tool to study language acquisition. From a machine learning point of view, 
unsupervised learning is a fertile ground for testing new learning methods, where significant improvements 
can yet be made. From a more pragmatic perspective, unsupervised learning is required since annotated 
corpora is a scarce resource for different reasons. Independently of the reason, unsupervised learning is an 
increasingly active field of research. 

A first problem with unsupervised learning, since we don't observe any labeled data (i.e., the training set is 
now T>i[ = {x^, . . . , x^}), is that most of the methods studied so far (e.g., Perceptron, MIRA, SVMs) cannot be 
used since we cannot compare the true output with the predicted output. Note also that a direct minimization 
of the complete negative log-likelihood of the data, logPg(X^ = x^,. . .,X^ = x"'), is very challenging, since it 
would require marginalizing out {i.e., summing over) all possible hidden variables: 



\ogPe{X^ =x\...,X 



M 



M 



m=l 1/6A 



(2.46) 



Note also that the objective above is non-convex even for a linear model: hence, it may have local minima, 
which makes optimization much more difficult. 

The most common optimization method in the presence of hidden (latent) variables is the Expectation 
Maximization (EM) algorithm, described in the next section. Note that this algorithm is a generic optimization 



routine that does not depend on a particular model. Later, in Section 2.7.2 we will apply the EM algorithm to 
the task of part-of -speech induction, where one is given raw text and a number of clusters and the task is to 
cluster words that behave similarly in a grammatical sense. 



55 



2.7.1 Expectation Maximization Algorithm 

Given the training corpus Djj := {x^, . . ., x^}, training seeks model parameters 9 that minimize the negative 
log-likelihood of the corpus: 



M 



Negative Log Likelihood: L{0) = E[- logP0(X)] 



M 



^log ^ P,(X = x™,Y = y" 



(2.47) 



m=l 



\y"'eA 



where ^[/(X)] 



'^n!=l fi^'") denotes the empirical average of a function / over the training corpus. Be- 



M 



cause of the hidden variables y^, . . . , y^, the likelihood term contains a sum over all possible hidden structures 
inside of a logarithm, which makes this quantity hard to compute. 

The most common minimization algorithm to fit the model parameters in the presence of hidden variables 
is the Expectation-Maximization (EM) algorithm. The EM procedure can be thought of intuitively in the fol- 
lowing way. If we observe the hidden variables' values for all sentences in the corpus, then we could easily 
compute the maximum likelihood value of the parameters as described in Section 2.3 On the other hand. 



if we had the model parameters we could label data using the model, and collect the sufficient statistics de- 



scribed in Section 2.3 However, since we are working in an unsupervised setting, we never get to observe 
the hidden state sequence. Instead, given the training set CDjj = {x^ . . . x^}, we will need to collect expected 
sufficient statistics (in this case, expected counts) that represent the number of times that each hidden variable is 
expected to be used with the current parameters setting. These expected sufficient statistics will then be used 
during learning as "fake" observations of the hidden variables. Using the node and edge posterior distribu- 



tions described in Equations 2.22 and 2.23 the expected sufficient statistics can be computed by the following 
formulas: 



M 



Initial Counts: Qnii{ck) = E PoiK = | X™ = x'«); 



m=l 



M N 



Transition Counts: Ctrans(cfc,c,) = E E Poir"' = Ck A Y/'li = c, \ X'" = x™); 

m=l 1=2 

M 

Final Counts: Cfi„al(cfc) = E = I X"' = x"'); 



M N 



Emission Counts: Cemiss (w^;, = E E ^i^T = ^j)Peiyr = Ck \ X"' = x™); 

m=l !=1 



(2.48) 
(2.49) 
(2.50) 
(2.51) 



Compare the previous equations with the ones described in Section 2.3 for the same quantities (Eqs. 2.4 
2.7 1. The main difference is that while in the presence of supervised data you sum the observed events, when 
you have no label data you sum the posterior probabilities of each event. If these probabilities were such that 
the probability mass was around single events then both equations will produce the same result. 

The EM procedure starts with an initial guess for the parameters 9^ at time t = 0. The algorithm keeps 
iterating until it converges to a local minima of the negative log likelihood. Each iteration is divided into two 
steps: 

• The E-Step (Expectation) computes the posteriors for the hidden variables Pgf(Y|X = x™) given the 
current parameter values 6^ and the observed variables X = x'" for the m-th sentence. For an HMM, this 
can be done through the Forward-Backward algorithm described in the previous sections. 

• The M-step (Maximization) uses those posteriors Pgt(Y|X = x'") to "softly fill in" the values of the 



hidden variables Y"', and collects the sufficient statistics: initial counts (Eq: 2.48 1, transition counts (Eq: 
2.49| , final counts (Eq: 2.50[ , and emission counts (Eq: 2.51| . These counts are then used to estimate 



maximum likelihood parameters 9^^^, as described in Section 



2.3 



The EM algorithm is guaranteed to converge to a local minimum of the negative log-likelihood L{9), under 
mild conditions. Note that we are not committing to the best assignment of the hidden variables, but summing 
the occurrences of each parameter weighed by the posterior probability of all possible assignments. This 
modular split into two intuitive and straightforward steps accounts for the vast popularity of EM|^ 



*More formally, EM minimizes an upper bound of £ (6) via block-coordinate descent. See Neal and Hinton 1 1998 1 for details. Also, even 
though we are presenting EM in the context of HMMs, this algorithm can be more broadly applied to any probabilistic model with latent 
variables. 
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Algorithm Mpresents the pseudo code for the EM algorithm. 



Algorithm 9 EM algorithm. 



input: dataset Vu, initial model 6 
for t = 1 to T do 

E-Step: 

Clear counts: Cinit(.) = Ctrans(-,-) = Cfinal(-) = Cemiss(v) = 0 

for x'" e Du do 

Compute posterior expectations Pg (Y, |X = x™) and Pg (Y„ Yi+i |X = x™) using the current model 6 
Update counts as shown in Eqs. [2.48] - 2.51 



end for 
M-Step: 

Update the model parameters 9 based on the counts, 
end for 



One important thing to note in Algorithm|9]is that for the HMM model we already have all the model pieces 
we require. In fact the only method we don't have yet implemented from previous classes is the method to 
update the counts given the posteriors. 

Exercise 2.10 Implement the method to update the counts given the state and transition posteriors, 
def update_counts (self , sequence, state_posteriors, transition_posteriors) : 



You now have all the pieces to implement the EM algorithm. Look at the code for EM algorithm in file sequences /hmm.py 
and check it for yourself. 



def traln_EM ( sel r , dataset, smoothing=0 , num_epochs=10, evaluate=True) : 
self . initial ize_random ( ) 



If evaluate: 

acc = sel f . evaluate_EM (dataset ) 
print "Initial accuracy: %f"% (acc) 



for t In xrange(l, num_epochs) : 
#E-Step 

total_log_likelihood =0.0 

self . clear_counts (smoothing) 

for sequence in dataset . seq_list : 

# Compute scores given the observation sequence. 

initial_scores, transition_scores, final_scores, emission_scores = \ 
self . compute_scores (sequence) 

state_posteriors, transition posteriors , log_likelihood = \ 

self. compute_posteriors (initial_scores , 

transition_scores , 
final_scores , 
emission_scores ) 



self. update_counts (sequence, state_posteriors , transition_posteriors) 

total_log_likelihood += log_likelihood 
print "Iter: %i Log Likelihood: %f"%(t, total_log_likelihood) 
#M-Step 

self . compute_parameters () 
if evaluate: 

### Evaluate accuracy at this iteration 

acc = sel f . evaluate_EM (dataset ) 

print "Iter: %i Accuracy: %f"%(t,acc) 
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2.7.2 Part of Speech Induction 



In this section we present the Part-of-Speech induction task. Part-of-Speech tags are pre-requisite for many 
text applications. The task of Part-of-Speech tagging where one is given a labeled training set of words and 
respective tags is a well studied task with several methods achieving high prediction quality, as we saw in the 
first part of this chapter. 

On the other hand the task of Part-of-Speech induction where one does not have access to a labeled corpus 
is a much harder task with a huge space for improvement. In this case, we are given only the raw text along 
with sentence boundaries and a predefined number of clusters we can use. This problem can be seen as a 
clustering problem. We want to cluster words that behave grammatically in the same way on the same cluster. 
This is a much harder problem. 

Depending on the task at hand we can pick an arbitrary number of clusters. If the goal is to test how well 
our method can recover the true POS tags then we should use the same number of clusters as POS tags. On 
the other hand, if the task is to extract features to be used by other methods we can use a much bigger number 
of clusters (e.g., 200) to capture correlations not captured by POS tags, like lexical affinity. 

Note, however that nothing is said about the identity of each cluster. The model has no preference in 
assigning cluster 1 to nouns vs cluster 2 to nouns. Given this non-identif lability several metrics have been 
proposed for evaluation (Reichart and Rappoport 2009 Haghighi and Klem} 2006 [Meila 2007 Rosenberg andj 
Hirschberg 2007). In this class we will use a common and simple metric called 1-Many, which maps each 



cluster to majority pos tag that it contains (see Figure 2.6 for an example). 




Figure 2.6: Confusion Matrix example. Each cluster is a column. The best tag in each column is represented 
under the column (1-many) mapping. Each color represents a true Pos Tag. 



Exercise 2.11 Run 20 epochs of the EM algorithm for part of speech induction: 

»> hmm. train_EM (train_seq, 0.1, 20, evaluate=True) 

»> viterbi_pred_test = hmm. viterbi_decode_corpus (test_seq) 

>>> poster ior_pred_test = hmm.posterior_decode_corpus (test_seq) 

»> eval_viterbi_test = hnim. evaluate_corpus (test_seq, viterbi_pred_test ) 

>>> eval_posterior_test = hmm . evaluate_corpus (test_seq, posterior_pred_test) 

Initial accuracy: 0.303638 

Iter: 1 Log Likelihood: -101824 . 7 63927 

Iter: 1 Accuracy: 0.305441 

Iter: 2 Log Likelihood: -78057 . 108346 

Iter: 2 Accuracy: 0.321976 

Iter: 3 Log Likelihood: -77813.725501 

Iter: 3 Accuracy: 0.357451 

Iter: 4 Log Likelihood: -77 1 92 . 94 7 674 

Iter: 4 Accuracy: 0.385109 

Iter: 5 Log Likelihood: -7 61 91 . 80084 9 
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Iter: 5 Accuracy: 0.392123 

Iter: 6 Log Likelihood: -75242 . 572729 

Iter: 6 Accuracy: 0.391121 

Iter: 7 Log Likelihood: -74 3 92 . 8 92 4 96 

Iter: 7 Accuracy: 0.404249 

Iter: 8 Log Likelihood: -73357 . 542833 

Iter: 8 Accuracy: 0.399940 

Iter: 9 Log Likelihood: -72135 . 182778 

Iter: 9 Accuracy: 0.399238 

Iter: 10 Log Likelihood: -70924 . 246230 

Iter: 10 Accuracy: 0.395430 

Iter: 11 Log Likelihood: -69906.561800 

Iter: 11 Accuracy: 0.394328 

Iter: 12 Log Likelihood: -69140 . 228623 

Iter: 12 Accuracy: 0.390821 

Iter: 13 Log Likelihood: -68541 . 416423 

Iter: 13 Accuracy: 0.391522 

Iter: 14 Log Likelihood: -68053 . 456865 

Iter: 14 Accuracy: 0.389117 

Iter: 15 Log Likelihood: -67 667 . 318961 

Iter: 15 Accuracy: 0.386411 

Iter: 16 Log Likelihood: -67 337 . 685686 

Iter: 16 Accuracy: 0.385409 

Iter: 17 Log Likelihood: -67054 . 571821 

Iter: 17 Accuracy : 0.385409 

Iter: 18 Log Likelihood: -667 69 . 97 3881 

Iter: 18 Accuracy: 0.385409 

Iter: 19 Log Likelihood: -66442 . 608458 

Iter: 19 Accuracy: 0.385409 

»> confusion_matrix = cm.build_confusion_matrix(test_seq.seq_list, viterbi_pred_test, 

len ( corpus . tag_dict ) , hmm . get_num_states () ) 
»> cm. plot_confusion_bar_graph ( confusion_matrix, corpus . tag_dict , 

xrange (hmm. get_num_states ( ) ) , 'Confusion matrix') 



Note: your results may not he the same as in this example since we are using a random start, hut the trend should he the 
same, with log-likelihood increasing at every iteration. 

In the previous exercise we used an HMM to do Part-of-Speech induction using 12 clusters (by omission 
the HMM uses as number of hidden states the one provided by the corpus). A first observation is that the 
log-likelihood is always increasing as expected. Another observation is that the accuracy goes up from 32% to 
38%. Note that normally you will run this algorithm for 200 iterations, we stopped earlier for time constraints. 
Another observation is that the accuracy is not monotonic increasing, this is because the likelihood is not a 
perfect proxy for the accuracy. In fact all that the likelihood is measuring are co-occurrences of words in the 
corpus; it has no idea of pos tags. The fact that we are improving derives from the fact that language is not 
random but follows some specific hidden patterns. In fact this patterns are what true pos-tags try to capture. 
A final observation is that the performance is really bad compared to the supervised scenario, so there is a 



lot of space for improvement. The actual state of the art is around 71% for fully unsupervised (Gra^a 2010, 



Berg-Kirkpatrick et al.}[2010^ and 80% ( jPas and Petrov[[20Tl) using parallel data and information from labels 



in the other language. 



Looking at Figure 2.6 it shows the confusion matrix for this particular example. A first observation is that 
most clusters are mapped to nouns, verbs or pimctuation. This is a known fact since there are many more 
noxms and verbs than any other tags. Since maximum likelihood prefers probabilities to be uniform (imagine 
two parameters: in one setting both have value 0.5 so the likelihood will be 0.5*0.5 = 0.25, while in the other 
case one as 0.1 and 0.9 so the maximum likelihood is 0.09). Several approaches have been proposed to address 
this problem, moving towards a Bayesian setting 0ohnsor{ 2007^ , or using Posterior Regularization (jGra^a 
etaLl|2009) . 
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Day 3 

Learning Structured Predictors 



In this class, we will continue to focus on sequence classification, but instead of following a generative ap- 
proach (like in the previous chapter) we move towards discriminative approaches. Recall that the difference 
between these approaches is that generative approaches attempt to model the probability distribution of the 
data, P(X, Y), whereas discriminative ones only model the conditional probability of the sequence, given the 
observed data, P(Y|X). 



Today's Assignment 

The assignment of today's class is to implement the structured version of the perceptron algorithm. 



3.1 Classification vs Sequential Classification 



Table [3T| shows how the models for classification have counterparts in sequential classification. In fact, in the 
last chapter we discussed the Hidden Markov model, which can be seen as a generalization of the Naive Bayes 
model for sequences. In this chapter, we will see a generalization of the Perceptron algorithm for sequence 



problems (yielding the Structured Perceptron, Collins|2002^ and a generalization of Maximum Entropy model 
for sequences (yielding Conditional Random Fields, Lafferty et al.^200 1 ). Note that both these generalizations 
are not specific for sequences and can be applied to a wide range of models (we will see in tomorrow's lecture 
how these methods can be applied to parsing). Although we will not cover all the methods described in 
Chapter [ij bear in mind that all of those have a structured counterpart. It should be intuitive after this lecture 
how those methods could be extended to structured problems, given the perceptron example. 



Classification Sequences 


Generative 


Naive Bayes (Sec. 1.2 


Hidden Markov Models (Sec. 


2.2 


Discriminative 


Perceptron (Sec. 1.4.3| 


Structured Perceptron (Sec. 3.5 


1 


Maximum Entropy (Sec. 1.4.41 


Conditional Random Fields (Sec. 3.4 



Table 3.1: Summary of the methods used for classification and sequential classification covered in this guide. 



Throughout this chapter, we will be searching for the solution of 

argmaxP(Y = y\X = x) = argmaxw ■ f{x,y). (3.1) 

yeA~ yeA^ 

where zv is a weight vector, and f{x,y) is a feature vector. We will see that this sort of linear models are more 
flexible than HMMs, in the sense that they may incorporate more general features while being able to reuse 
the same decoders (based on the Viterbi and forward-backward algorithms). In fact, the exercises in this lab will 
not touch the decoders that have been developed in the previous lab. Only the training algorithms and the function 
that compute the scores will change. 

As in the previous section, y = i/i . . . i/n is a sequence so the maximization is over an exponential number 
of objects, making it intractable for brute force methods. Again we will make an assumption analogous to the 
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"first order Markov independence property," so that the features may decompose as a sum over nodes and 
edges in a trellis. This is done by assuming that expression 3.1 can be written as: 

N N-1 

arg max ^ w ■ /gmiss 

{i,x,yi) + w ■ f^^i{x,yi) + ^ ■ ft,^,nsi^,x,yi,yi+i) + w ■ /w(^.yN) (3.2) 
yeA^' i=l'' ' ' ' 1=1 ' ' ' ^ ' 

SCOreemiss SCOreinit SCOretrans SCOrefinai 

In other words, the scores scoreemiss/ scoreinit/ scoretrans and scorefj^ai are now computed as inner products 
between weight vectors and feature vectors rather than log-probabilities. The feature vectors depend locally 
on the output variable {i.e., they only look at a single or a pair y{,yi^{); however they may depend globally 
on the observed input x = x\,. . . , x^- 



3.2 Features 

In this section we will define two simple feature sets[^ The first feature set will only use identity features, 
and will mimic the features used by the HMM model from the previous section. This will allow us to directly 
compare the performance of a generative vs a discriminative approach. Table [3l2| depicts the features that are 
implicit in the HMM, which was the subject of the previous chapter. These features are indicators of initial, 
transition, final, and emission events. 



Condition 


Name 


J/j = cjt & f = 0 


Initial Features 


3// = C)t & = C; 


Transition Features 


yi = Ck&:i = N 


Final Features 


Xi = Wj & yi = Cfc 


Emission Features 



Table 3.2: iDFeatures feature set. This set replicates the features used by the HMM model. 



Note that the fact that we were using a generative model has forced us (in some sense) to make strong 
independence assumptions. However, since we now move to a discriminative approach, where we model 
P{Y\X) rather than P{X, Y), we are not tied anymore to some of these assumptions. In particular: 

• We may use "overlapping" features, e.g., features that fire simultaneously for many instances. For exam- 
ple, we can use a feature for a word, such as a feature which fires for the word "brilliantly", and another 
for prefixes and suffixes of that word, such as one which fires if the last two letters of the word are "ly". 
This would lead to an awkward model if we wanted to insist on a generative approach. 

• We may use features that depend arbitrarily on the entire input sequence x. On the other hand, we still 
need to resort to "local" features with respect to the outputs {e.g. looking only at consecutive state pairs), 
otherwise decoding algorithms will become more expensive. 

This leads us to the second feature set, composed of features that are traditionally used in POS tagging 



with discriminative models. See Table 3.3 for some examples. Of course, we could have much more complex 
features, looking arbitrarily to the input sequence. We are not going to have them in this exercise only for 
performance reasons (to have less features and smaller caches). State-of-the-art sequence classifiers can easily 
reach over one million features! 

Our features subdivide into two groups: /gmiss' /inif ^^'^ /final instances of node features, depending 

only on a single position in the state sequence (a node in the trellis); /tj^rLs edge features, depending on two 
consecutive positions in the state sequence (an edge in the trellis). Similarly as in the previous chapter, we 
have the following scores (also called log-potentials in the literature on CRFs and graphical models): 

• Initial scores. These are scores for the initial state. They are given by 

scoreinit(x,i/i) = w ■ /init(^/!/i)- (3-3) 

• Transition scores. These are scores for two consecutive states at a particular position. They are given by 

scoretrans (2, X, yi, ) =w (f, X, yi, yi+i ) . (3.4) 



^Although not required, all the features we will use are binary features, indicating whether a given condition is satisfied or not. 
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Condition 


Name 


y 


— C]( CSC I — u 


TT^li"1Cll T-<Oa4"11T*OC 

llllLlcli J^cclLUlt-O 


y 


= Cfc & = C/ 


TV'CIT^CI i"1 /^Tl K o a 4"1 1 T*OC 

iidiioiLiuii reaiuieb 


y 


^ 0^ V Ay 

— C\ QS.t — iV 


midi -Teatureo 


X 




Ddbic j_jiiiibbiuii Featureb 


X 


= zvj & zvj is uppercased & j/, = cj. 


Uppercase Features 


X 


= & Wj contains digit & y; — Cfc 


Digit Features 


X 


= zfy & Wj contains hyphen & y; = cj- 


H5rphen Features 


X 


= wy & Wj[0..i]\fi e [1,2,3] = Cfc 


Prefix Features 


X 


= if^- & If/ [ if; — L. ]Vz e [1,2,3] & y, = Cjt 


Suffix Features 



Table 3.3: Extended feature set. Some features in this set could not be included in the HMM model. The fea- 
tures included in the bottom row are all considered emission features for the purpose of our implementation, 
since they all depend on i, x and y,. 



• Final scores. These are scores for the final state. They are given by 

scorefinai(x,yN) = w ■ /finai(x,yN). 

• Emission scores. These are scores for a state at a particular position. They are given by 

scoreemiss {i,x,yi)^w- f^^^ {i, x, y,- ) . 



(3.5) 



(3.6) 



3.3 Discriminative Sequential Classifiers 

Given a weight vector w, the conditional probability Pit,(y = y|X = x) is then defined as follows: 

P^(y = y|X = x) = 

/ N-1 N \ 

Z{w,x) ( ^•/mit(^'yi)+ E W-/trans(«>'y<'ym)+«'-/final(^'yN) + E«'-/emiss('>'yO 1(3-7) 

where the normalizing factor Z{w, x) is called the partition function: 
Z(w,x) — 

/ N-1 N \ 

exp w ■ f^i{x,yi) + E ■ fti^{hx,yi,yi+i) + w ■ fi^i{x,yt^) + E ' /emiss('>'yO (3.8) 
i/eAW \ i=i i=i J 

3.3.1 Training 

For tiaining, the important problem is that of obtaining the weight vector w that lead to an accurate classifier. 
We will discuss two possible strategies: 

1. Maximizing conditional log-likelihood from a set of labeled data {{x"',y^)}^^^, yielding conditional 
random fields. This corresponds to the following optimization problem: 



M 

w = argmax £ logPi„(Y = y^'lX = x""). 

w m=l 



(3.9) 



To avoid overfitting, it is common to regularize with the Euclidean norm function, which is equivalent 
to considering a zero-mean Gaussian prior on the weight vector. The problem becomes: 



M ™, A 

w — argmax ^ logPi„(Y — y'"|X — x^^ 

w m=l 



'"^ - '-iiu;||2. 



(3.10) 



This is precisely the structured variant of the logistic regression method discussed in Chapter 1. Un- 
like HMMs, this problem does not have a closed form solution and has to be solved with numerical 
optimization. 
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2. Alternatively, running the structured perceptron algorithm to obtain a weight vector w that accurately 
classifies the training data. We will see that this simple strategy achieves results which are competitive 
with conditional log-likelihood maximization. 

3.3.2 Decoding 

For decoding, there are three important problems that need to be solved: 

1. Given X = x, compute the most likely output sequence y (the one which maximizes Piv{y = y\X = x)). 

2. Compute the posterior marginals PwO^i = y, |X = x) at each position i. 

3. Evaluate the partition function Z{w,x). 

Interestingly, all these problems can be solved by using the very same algorithms that were already imple- 
mented for HMMs: the Viterbi algorithm (for 1) and the forward-backward algorithm (for 2-3). All that 
changes is the way the scores are computed. 



3.4 Conditional Random Fields 



Conditional Random Fields (CRF) (Lafferty et aL}|200T) can be seen as an extension of the Maximum Entropy 



(ME) models to structured problems]^ They are globally normalized models: the probability of a given sentence 



is given by Equation 3.7 There are only two differences with respect to the standard ME model described a 



couple of days ago for multi-class classification: 

• Instead of computing the posterior marginals P(y = y\X = x) for all possible configurations of the 
output variables (which are exponentially many), it assumes the model decompose into "parts" (in this 
case, nodes and edges), and it computes only the posterior marginals for those parts, P(Y, = y,|X = x) 
and P(Y, = y„ Y,+i = yz+i |X = x). Crucially, we can compute these quantities by using the very same 
forward-backward algorithm that we have described for HMMs. 

• Instead of updating the features for all possible outputs y' G A^, we again exploit the decomposition 
into parts above and update only "local features" at the nodes and edges. 



Algorithm 10 shows the pseudo code to optimize a CRF with the stochastic gradient descent (SGD) al- 
gorithm (our toolkit also includes an implementation of a quasi-Newton method, L-BFGS, which converges 
faster, but for the purpose of this exercise, we will stick with SGD). 

Exercise 3.1 In this exercise you will train a CRF using dijferent feature sets for Part-of-Speech Tagging. Start with the 
code below, which uses the ID feature set from table\3.2\ 



Import Ixmls . sequences . crf_online crfo 

import Ixmls . sequences . structured_perceptron spc 

import Ixmls . readers .pos_corpus pec 

import Ixmls . sequences . id_feature as idfc 

import Ixmls . sequences . extended_feature as exfc 

print "CRF Exercise" 

corpus = pec . PostagCorpus ( ) 

train_seq = corpus . read_sequence_list_conll ( "data/train- 02 -21 . con 11 " , max_sent_len=l 0 , 
max_nr_sent=l 000) 

test_seq = corpus . read_sequence_list_conll ( "data/test-23 . conll " , max_sent_len=l 0 , 
max_nr_sent=l 000) 

dev_seq = corpus . read_sequence_list_conll ( "data/dev-22 . conll " , max_sent_len=l 0 , max_nr_sent 
=1000) 

feature_mapper = idfc . IDFeatures (train_seq) 
feature_mapper . build_features ( ) 



^ An earlier, less successful, attempt to perform such an extension was via Maximum Entropy Markov models (MEMM) i McCallum 
[et al. 2000 L There each factor (a node or edge) is a locally normalized maximum entropy model. A shortcoming of MEMMs is its so- 
called labeling bias |Bottou| [l991[ , vifhich makes them biased towards states with few successor states (see [Lafferty et al.,^200i] for more 
information). 
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Algorithm 10 SGD for Conditional Random Fields 



1: input: D, A, number of rounds T, learning rate sequence {t]t)t=l,...,T 

2: initialize = 0 

3: for t = 1 to T do 

4: choose m = m{t) randomly 

5; take training pair (x"',i/'") and compute the probability Ptv{y = = x™) using the current model w 



and Eq. 3.7 



6: for every i, y'-, and y-^j, compute marginal probabilities P(Y/ = y-jX = x) and P(Y/ = y-, Y/_|_i = 
yj+l 1^ = x™) using the current model w, for each node and edge, through the forward-backward algo- 
rithm. 

7: compute the feature vector expectation: 

yi 

N-1 

E E ^(y/'ymk™)/trans(^''^™ y!'ym) + 

!=1 Vi'Vi+l 

EPiyN\nfiinA^"',yN) + 

Vn 
N 

EE^(y-l^'")/emiss(^^^yO- (3.11) 

1=1 Vi 

8: update the model: 

^ (1 _ x,^,)w' + rjt {f{x"',yn - E^<[/(x"',y)]) 

9: end for 
10: output: w 



crf_online = erf o . CRFOnline (corpus . word_dict , corpus . tag_dict , f eat ur e_mapper) 

crf_online . num_epochs = 20 

crf_online . train_supervised (train_seq) 

Epoch: 0 Objective value: -5.779018 

Epoch: 1 Objective value: -3.192724 

Epoch: 2 Objective value: -2.717537 

Epoch: 3 Objective value: -2.436614 

Epoch: 4 Objective value: -2.240491 

Epoch: 5 Objective value: -2.091833 

Epoch: 6 Objective value: -1.973353 

Epoch: 7 Objective value: -1 . 875643 

Epoch: 8 Objective value: -1.793034 

Epoch: 9 Objective value: -1.721857 

Epoch: 10 Objective value: -1.659605 

Epoch: 11 Objective value: -1.604499 

Epoch: 12 Objective value: -1.555229 

Epoch: 13 Objective value: -1.510806 

Epoch: 14 Objective value: -1.470468 

Epoch: 15 Objective value: -1.433612 

Epoch: 16 Objective value: -1.399759 

Epoch: 17 Objective value: -1.368518 

Epoch: 18 Objective value: -1 . 339566 

Epoch: 19 Objective value: -1.312636 



You will receive feedback when each epoch is finished, note that running the 20 epochs might take a while. 
After training is done, evaluate the learned model on the training, development and test sets. 

pred_train = crf_online . viterbi_decode_corpus (train_seq) 
pred_dev = crf_online . viterbi_decode_corpus (dev_seq) 
pred_test = crf_online . viterbi_decode_corpus (test_seq) 
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eval_train = crf_online . evaluate_corpus (train_seq, pred_train) 
eval_dev = crf_online . evaluate_corpus (dev_seq, pred_dev) 
eval_test = crf_online . evaluate_corpus (test_seq, pred_test) 

print "CRF - ID Features Accuracy Train: %.3f Dev. %.3f Test: % . 3f"% (eval_train, eval_dev, 
eval_test ) 



You should get values similar to these: 

Out [ ] : CRF - 

ID Features Accuracy Train: 0.949 Dev: 0.846 Test: 0.858 



Compare with the results achieved with the HMM model (0.837 on the test set, from the previous lecture). 
Even using a similar feature set a CRF yields better results than the HMM from the previous lecture. Perform 
some error analysis and figure out what are the main errors the tagger is making. Compare them with the 
errors made by the HMM model. (Hint: use the methods developed in the previous lecture to help you with 
the error analysis). 

Exercise 3.2 Repeat the previous exercise using the extended feature set. Compare the results. 

feature_mapper = exfc.ExtendedFeatures(train_seq) 
feature_mapper . build_features ( ) 

crf_online = erf o . CRFOnline (corpus . word_dict , corpus . tag_dict, feature_mapper) 

crf_online . num_epochs = 20 

crf_online . train_supervised (train_seq) 



Epoch : 


0 Objective value: 


-7. 


141596 


Epoch : 


1 Objective value: 


-1 . 


807511 


Epoch : 


2 Objective value: 


-1 . 


218877 


Epoch : 


3 Objective value: 


-0. 


955739 


Epoch : 


4 Objective value: 


-0. 


807821 


Epoch : 


5 Objective value: 


-0. 


712858 


Epoch : 


6 Objective value: 


-0. 


647382 


Epoch : 


7 Objective value: 


-0. 


599442 


Epoch : 


8 Objective value: 


-0. 


562584 


Epoch : 


9 Objective value: 


-0. 


533411 


Epoch : 


10 Objective value. 


-0 


.509885 


Epoch : 


11 Objective value. 


-0 


. 490548 


Epoch : 


12 Objective value. 


-0 


.474318 


Epoch : 


13 Objective value. 


-0 


.460438 


Epoch : 


14 Objective value. 


-0 


. 448389 


Epoch : 


15 Objective value. 


-0 


. 437800 


Epoch : 


16 Objective value. 


-0 


. 428402 


Epoch : 


17 Objective value. 


-0 


. 419990 


Epoch : 


18 Objective value. 


-0 


. 412406 


Epoch : 


19 Objective value. 


-0 


. 405524 



pred_train = crf_online . viterbi_decode_corpus (train_seq) 
pred_dev = crf_online . viterbi_decode_corpus (dev_seq) 
pred_test = crf_online .viterbi_decode_corpus (test_seq) 

eval_train = crf_online . evaluate_corpus (train_seq, pred_train) 
eval_dev = crf_online . evaluate_corpus (dev_seq, pred_dev) 
eval_test = crf_online . evaluate_corpus (test_seq, pred_test) 

print "CRF - Extended Features Accuracy Train: %.3f Dev: %.3f Test: % . 3f"% (eval_train, 
eval_dev, eval_test ) 



You should get values close to the following: 
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CRF - Extended Features Accuracy Train: 0.984 Dev. 0. 



Test: 0.894 



Compare the errors obtained with the two different feature sets. Do some error analysis: what errors were correct by 
using more features? Can you think of other features to use to solve the errors found? 

The main lesson to learn from this exercise is that, usually, if you are not satisfied by the accuracy of your 
algorithm, you can perform some error analysis and find out which errors your algorithm is making. You can 
then add more features which attempt to improve those specific errors (this is known as feature engineering). 
This can lead to two problems: 

• More features will make training and decoding more expensive. For example, if you add features that 
depend on the current word and the previous word, the number of new features is the square of the 
number of different words, which is quite large. For example, the Perm Treebank has around 40000 
different words, so you are adding a lot of new features, even though not all pairs of words will ever 
occur. Features that depend on three words (previous, current, and next) are even more numerous. 

• If features are very specific, such as the (previous word, current word, next word) one just mentioned, 
they might occur very rarely in the training set, which leads to overfit problems. Some of these problems 
(not all) can be mitigated with techniques such as smoothing, which you already learned about. 



3.5 Structured Perceptron 

The structured perceptron ( |Collins[ |2002^, namely its averaged version, is a very simple algorithm that relies on 
Viterbi decoding and very simple additive updates. In practice this algorithm is very easy to implement and 
behaves remarkably well in a variety of problems. These two characteristics make the structured perceptron 
algorithm a natural first choice to try and test a new problem or a new feature set. 

Recall what you learned about the perceptron algorithm (Section 1.4.3| and compare it against the struc- 
tured perceptron (Algorithm [TT). 

Algorithm 11 Averaged Structured perceptron 



input: D, number of rounds T 
initialize ^ = 0 
for t = 1 to T do 

choose m = m{t) randomly 

take training pair {x™,y"') and predict using the current model w, through the Viterbi algorithm: 

y <— argmaxzf' ■ /(x™,i/') 

update the model: w'+i ^ + /(x'",y'«) - f{x"',y) 
end for 

output: the averaged model w <— ^ TJ=-[ 



There are only two differences, which mimic the ones already seen for the comparison between CRFs and 
multi-class ME models: 

• Instead of explicitly enumerating all possible output configurations (exponentially many of them) to 
compute y := argmax^/^y w ■ f{x"',y'), it finds the best sequence through the Viterbi algorithm. 

• Instead of updating the features for the entire y, it updates only the node and edge features at the posi- 
tions where the labels are different — i.e., where mistakes are made. 



3.6 Assignment 

Exercise 3.3 Implement the structured perceptron algorithm. 

To do this, edit file sequences/structured-perceptron . py and implement the function 
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def perceptron_update (self , sequence) : 
pass 



This function should apply one round of the perceptron algorithm, updating the weights for a given sequence, and return- 
ing the number of predicted labels (which equals the sequence length) and the number of mistaken labels. 
Hint: adapt the function 

def grad±ent_update(self, sequence, eta): 



defined in file sequences/crf .online .py. You will need to replace the computation of posterior marginals by the 



Viterbi algorithm, and to change the parameter updates according to Algorithm 11 Note the role of the functions 

self. feature_mapper . get_ *_features ( ) 

in providing the indices for the features obtained for f{x'",y'") or f{x'",y) 



Exercise 3.4 Repeat Exercises 3.1\3.2 using the structured perceptron algorithm instead of a CRF. Report the results. 



Here is the code for the simple feature set: 



feature_mapper = idfc . IDFeatures (traln_seq) 
feature_mapper . build_features ( ) 

print "Perceptron Exercise" 

sp = spc . StructuredPerceptron (corpus . word_dict, corpus . tag_dict, feature_mapper ) 

sp . num_epochs = 20 

sp . train_supervised (train_seq) 

Epoch: 0 Accuracy: 0.656806 
Epoch: 1 Accuracy: 0.820898 
Epoch: 2 Accuracy: 0.879176 
Epoch: 3 Accuracy: 0.907432 
Epoch: 4 Accuracy: 0.925239 
Epoch: 5 Accuracy: 0.939956 
Epoch: 6 Accuracy: 0.946284 
Epoch: 7 Accuracy: 0.953790 
Epoch: 8 Accuracy: 0.958499 
Epoch: 9 Accuracy: 0.955114 
Epoch: 10 Accuracy: 0.959235 
Epoch: 11 Accuracy: 0.968065 
Epoch: 12 Accuracy: 0.968212 
Epoch: 13 Accuracy: 0.966740 
Epoch: 14 Accuracy: 0.971302 
Epoch: 15 Accuracy: 0.968653 
Epoch: 16 Accuracy: 0.970419 
Epoch: 17 Accuracy: 0.971891 
Epoch: 18 Accuracy: 0.97174 4 
Epoch: 19 Accuracy: 0.973510 

pred_train = sp . viterbi_decode_corpus (train_seq) 
pred_dev = sp . viterbi_decode_corpus (dev_seq) 
pred_test = sp . viterbi_decode_corpus (test_seq) 



eval_train = sp . evaluate_corpus (train_seq, pred_train) 
eval_dev = sp . evaluate_corpus (dev_seq, pred_dev) 
eval_test = sp . evaluate_corpus (test_seq, pred_test) 

print "Structured Perceptron - ID Features Accuracy Train: %.3f Dev: %.3f Test: %.3f"%( 
eval_train, eval_dev, eval_test ) 
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structured Perceptron - ID Features Accuracy Train: 0.984 Dev. 0.835 Test: 0.840 



Here is the code for the extended feature set: 

feature_mapper = exfc .ExtendedFeatures {train_seq) 
feature_mapper . build_features () 

sp = spc . StructuredPerceptron (corpus . word_dict, corpus . tag_dict, feature_mapper ) 

sp . num_epochs = 20 

sp. train_supervised (train_seq) 



Epoch: 0 Accuracy: 0.764386 

Epoch: 1 Accuracy: 0.872701 

Epoch: 2 Accuracy: 0.903458 

Epoch: 3 Accuracy: 0.927594 

Epoch: 4 Accuracy: 0.938484 

Epoch: 5 Accuracy: 0.951141 

Epoch: 6 Accuracy: 0.949816 

Epoch: 7 Accuracy: 0.959529 

Epoch: 8 Accuracy: 0.957616 

Epoch: 9 Accuracy: 0.962325 

Epoch: 10 Accuracy: 0.961148 

Epoch: 11 Accuracy: 0.970567 

Epoch: 12 Accuracy: 0.968212 

Epoch: 13 Accuracy: 0.973216 

Epoch: 14 Accuracy: 0.974393 

Epoch: 15 Accuracy: 0.973951 

Epoch: 16 Accuracy: 0.976600 

Epoch: 17 Accuracy: 0.977483 

Epoch: 18 Accuracy: 0.974834 

Epoch: 19 Accuracy: 0.977042 



pred_train = sp . viterbi_decode_corpus (train_seq) 
pred_dev = sp . viterbi_decode_corpus (dev_seq) 
pred_test = sp . viterbi_decode_corpus (test_seq) 

eval_train = sp . evaluate_corpus (train_seq, pred_train) 
eval_dev = sp . evaluate_corpus (dev_seq, pred_dev) 
eval_test = sp . evaluate_corpus (test_seq, pred_test) 



print "Structured Perceptron - Extended Features Accuracy Train: %.3f Dev: %.3f Test: %. 
3f"% (eval_train, eval_dev, eval_test ) 



And here are the expected results: 



structured Perceptron - Extended Features Accuracy Train: 0.984 Dev: 0.888 Test: 0.890 
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Day 4 

Syntax and Parsing 



In this lab we will implement some exercises related with parsing. 



4.1 Phrase-based Parsing 
4.1.1 Context Free Grammars 

Let T be an alphabet {i.e., a finite set of symbols), and denote by T* its Kleene closure, i.e., the infinite set of 
strings produced with those symbols: 

T* = 0 U T U U . . . 

A language L is a subset of T* . The "complexity" of a language L can be loosely defined by how hard it is to 
construct a machine (an automaton) capable of distinguishing the words in L from the elements of T* which 
are not in If L is finite, a very simple automaton can be built which just memorizes the strings in L. The 
next simplest case is that of regular languages, which are recognizable hy finite state machines. These are the 
languages that can be expressed by regular expressions. An example (where T = {a,b}) is the language 
L = {ab"aa'^ \ n e N}, which corresponds to the regular expression ab * a+. Hidden Markov models (studied in 
previous lectures) can be seen as a stochastic version of finite state machines. 

A step higher in the hierarchy of languages leads to context-free languages, which are more complex than 
regular languages. These are languages that are generated by context-free grammars, and recognizable by push- 
down automata (which are slightly more complex than finite state machines). In this section we describe context- 
free grammars and how they can be made probabilistic. This will yield models that are more powerful than 
hidden Markov models, and are specially amenable for modeling the syntax of natural languages]^ 

A context-free grammar (CFG) is a tuple G = (K, T, 3?, s) where: 

1. 3Nf is a finite set of non-terminal symbols. Elements of are denoted by upper case letters {A, B,C, . . .). 
Each non-terminal symbol is a syntactic category: it represents a different type of phrase or clause in the 
sentence. 

2. T is a finite set of terminal symbols (disjoint from ?^). Elements of T are denoted by lower case letters 
{a, b,c, . . .). Each terminal symbol is a surface word: terminal symbols make up the actual content of 
sentences. The set T is called the alphabet of the language defined by the grammar G. 

3. 3? is a set of production rules, i.e., a finite relation from 3\f to CNU T)*. G is said to be in Chomsky normal 
form (CNF) if production rules in H are either of the form A ^ BC or A ^ a. 

4. S is a start symbol, used to represent the whole sentence. It must be an element of 

Any CFG can be transformed to be in CNF without loosing any expressive power in terms of the language 
it generates. Hence, we henceforth assume that G is in CNF without loss of generality. 

To see how CFGs can model the syntax of natural languages, consider the following sentence. 



iVVe recommend the classic book by Hopcroft et al. 1979 for a thorough introduction on the subject of automata theory and formal 
languages. 

^This does not mean that natural languages are context free. There is an immense body of work on grammar formalisms that relax 
the "context-free" assumption, and those formalisms have been endowed with a probabilistic framework as well. Examples are: lex- 
ical functional grammars, head-driven phrase structured grammars, combinatorial categorial grammars, tree adjoining grammars, etc. 
Some of these formalisms are mildly context sensitive, a relaxation of the "context-free" assumption which still allows polynomial parsing 
algorithms. There is also equivalence in expressive power among several of these formalisms. 
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She enjoys the Summer school. 

along with a grammar (in CNF) with the following production rules: 

S — > NP VP 
NP — > Det N 

NP — > She 
VP — > V NP 
V — > enjoys 
Det — > the 
Nbar — > N N 
N — > Summer 
N — > school 

With this grammar, we may derive the following parse tree: 

S 




^"J°y' Det Nbar 

the N^^^ 
Summer school 

4.1.2 Ambiguity 

A fundamental characteristic of natural languages is ambiguity. For example, consider the sentence 

The man sees the boy in the park with a telescope, 
for which all the following parse trees are plausible interpretations: 

1. The boy is in a park and he has a telescope: 

S 




with Det N 



NP PP 
Det N P^^^^NP 

the boy Det N a telescope 

I 

the park 

2. The boy is in a park, and the man sees him using a telescope as an instrument: 
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NP PP 
Det N p jsjp 

the boy jn Det N 

I I 
the park 

3. The man is in the park and he has a telescope, through which he sees a boy somewhere: 




The ambiguity is caused by the several places to each the prepositional phrase could be attached. This kind 
of syntactical ambiguity {PP -attachment) is one of the most frequent in natural language. 

4.1.3 Probabilistic Context-Free Grammars 

A probabilistic context-free grammar is a tuple Ge = (N, 7, % S, 0), where (3Nf, 7,%s) is a CFG and 0 is a vector 
of parameters, one per each production rule in 5i. Assuming that the grammar is in CNF, each rule of the kind 
Z XY is endowed a conditional probability 

dz^xY = Pe(XY|Z), 

and each imary rule of the kind Z — > w is endowed with a conditional probability 

For these conditional probabilities to be well defined, the entries of 6 must be non-negative and need to nor- 
malize properly for each Z E'N: 

X] ^Z^XY + ^Z->-w — 1- 

Let s be a string and t a parse tree derivation for s. For each r £ ']l,\einr{t,s)he the number of times production 
rule r appears in the derivation. According to this generative model, the joint probability of t and s factors as 
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the product of the conditional probabilities above: 

reOl 

For example, for the sentence above (She enjoys the Summer school) this probability would be 

P{t,s) = P(NP VP|S) X P(She|NP) X P(V NP|VP) X P(en joYs|v) 
xP(Det Nbar|NP) X P(the|Det) x P(n N|Nbar) 

xP(Summer|N) X P(school|N). (4.1) 

When a sentence is ambiguous, the most likely parse tree can be obtained by maximizing the conditional 
probability P{t\s); this quantity is proportional to P{t,s) and therefore the latter quantity can be maximized. 
The number of possible parse trees, however, grows exponentially with the sentence length, rendering a direct 
maximization intractable. Fortimately, a generalization of the Viterbi algorithm exists which uses dynamic 
programming to carry out this computation. This is the subject of the next section. 

4.1.4 The CKY Parsing Algorithm 



Algorithm 12 CKY algorithm 

1: input: probabilistic CFG Gg in CNF, and sentence s — Wi. . . Wjv (each is a word) 
2: 

3: {We'll fill the CKY table bottom-up with partial probabilities S and backtrack pointers ip. This is similar to 
the Viterbi algorithm but it works for parse trees rather than sequences.} 

4: 

5: {Initialization} 
6; for / = 1 to N do 

7: for each production rule r G 3? of the form Z — > Wj do 
8: S{i, i, Z) = Qz^Wi 
9: end for 
10: end for 

11: 

12: {Induction} 

13: for i — lioN Ao {i is length of span} 

14: for 7 = 1 to N — f + 1 do {;' is start of span} 

15: for each non-terminal Z e J\f do 

16: Set partial probability: 

5{i,i + i-\,Z) = max 6{i,k - l,X) x 6{k,i + i -1,Y) Oz^XY 
j<k<i+i 

17: Store backpointer: 

^pij,j + i-l,Z) = argmax(5(;,fc- 1,X) x S{k,j + i - 2,Y) x Oz^xY 

X,Y 
j<k<j+i 

18: end for 
19: end for 
20: end for 

21: 

22: {Termination} 

23: P(s,f) = S{1,N,S) 

24: Backtrack through i/' to obtain most likely parse tree f 



One of the most widely known algorithm for parsing natural language sentences is the Cocke-Kasami- 
Younger (CKY) algorithm. Given a grammar in CNF with production rules, its runtime complexity for 
parsing a sentence of length N is 0(N^|3?|). We present here a simple extension of the CKY algorithm that 
obtains the most likely parse tree of a sentence, along with its probability.Given a sequence of words Wj in a 
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sentence s = wi, . . . , Wf^, we want to learn probability for each possible non-terminal rule Z G [R, and take the 
most probable parsing tree that generates the sentence. |^ This is displayed in Alg. 
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This algorithm assigns a probability 6{i,i, Z), for every possible subtree of the sentence, starting from word 
j and spanning k words {wj, Wj^i, . . . , zVj^]^), which are generated by rule Z. The process goes from spanning 
sequences of length one to n and for each size partitioning the subtree into two parts X, and Y to compute the 
probability that each production rule Z — )• X, Y generates the subsequence. Those of length one correspond 
to the set of terminal rules Z — )• Wj, to which it is assign a conditional probability &{i,i,Z) = 6z-^w At each 
step the most probable rule Z X, Y is stored in ip{i,j,Z). Once the process is completed, the sentence is 
recognized by the grammar if the subsequence containing the entire sentence is matched by the start symbol, 
and the most probable parse tree t can be recovered by iterating backwards through the stored rules. 

Exercise 4.1 In this simple exercise, you will see the CKY algorithm in action. There is a Javascript applet that illustrates 
how CKY works (in its non-probabilistic form). Go to http : //www. diotavelli . net /peopl e/ void/ demos /^ 



cky . html\ and observe carefully the several steps taken by the algorithm. Write down a small grammar in CNF that 



yields multiple parses for the ambiguous sentence The man saw the boy in the park with a telescope, and run the 
demo for this particular sentence. What would happen in the probabilistic form of CKY? 

4.1.5 Learning the Grammar 



There is an immense body of work on grammar induction using probabilistic models (see e.g., Manning and 
Schiitze ( 1999 ) and the references therein, as well as the most recent works of Klein and Manning ( |2002 i; Smith 



and Eisner, ( ,2005, ); |Cohen et al. ( p008^ ): this is the problem of learning the parameters of a grammar from 



plain sentences only. This can be done in an EM fashion (like in sequence models), except that the forward- 
backward algorithm is replaced by inside-outside. Unfortunately, the performance of unsupervised parsers 
is far from good, at present days. Much better results have been produced by supervised systems, which, 
however, require expensive annotation in the form of treebanks: this is a corpus of sentences annotated with 
their corresponding S5m tactic trees. The following is an example of an annotated sentence in one of the most 
widely used treebanks, the Penn Treebank ( |http : / /www . els . upenn . edu / ~ treebank/ 1: 

( (S 

(NP-SBJ (NNP BELL) (NNP INDUSTRIES) (NNP Inc.) ) 
(VP (VBD increased) 

(NP (PRP$ its) (NN quarterly) ) 
(PP-DIR (TO to) 

(NP (CD 10) (NNS cents) )) 
(PP-DIR (IN from) 
(NP 

(NP (CD seven) (NNS cents) ) 
(NP-ADV (DT a) (NN share) ) ) ) ) 
(. .) ) ) 

Exercise 4.2 This exercise will show you that real-world sentences can have complicated syntactic structures. There is a 



parse tree visualizer in http: //www. ark . cs . emu . edu/treeviz/ Go to your local data/ treebanks folder 



and open the file PTB.excerpt . txt. Copy a few trees from the file, one at a time, and examine their parse trees in the 
visualizer. 

A treebank makes possible to learn a parser in a supervised fashion. The simplest way is via a generative 
approach. Instead of counting transition and observation events of an HMM (as we did for sequence models), 
we now need to count production rules and symbol occurrences to estimate the parameters of a probabilistic 
context-free grammar. While performance would be much better than that of unsupervised parsers, it would 
still be rather poor. The reason is that the model we have described so far is oversimplistic: it makes too strong 
independence assumptions. In this case, the Markovian assumptions are: 

1. Each terminal symbol w in some position i is independent of everything else given that it was derived 
by the rule Z ^ iv (i.e., given its parent Z); 

2. Each pair of non-terminal symbols X and Y spanning positions i to /, with split point k, is independent 
of everything else given that they were derived by the rule Z XY (i.e., given their parent Z). 

^Similarly, the forward-backward algorithm for c omputing posterior marginal s in sequence models can be extended for context-free 
parsing. It takes the name inside-outside algorithm. See |Manning and Schutze|jl999| for details. 
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The next section describes some model refinements that complicate the problem of parameter estimation, 
but usually allow for a dramatic improvement on the quality of the parser. 



4.1.6 Model Refinements 

A number of refinements has been made that yield more accurate parsers. We mention just a few: 

Parent annotation. This strategy splits each non-terminal symbol in the grammar {e.g. Z) by annotating it with 
all its possible parents {e.g. creates nodes Z^, Z^, . . . every time production rules like X — > Z-, X — > -Z, 
or Y ^ Z-, Y ^ Z exist in the original grammar). This increases the vertical Markovian length of the 



model, hence weakening the independence assumptions. Parent annotation was initiated by Johnson 



(jT998j and carried on in the unlexicalized parsers of Klein and Manning_ (2003 l and follow-up works. 

Lexicalization. A particular weakness of PCFGs is that they ignore word context for interior nodes in the 
parse tree. Yet, this context is relevant in determining the production rules that should be invoked in 
the derivation. A way of overcoming this limitation is by lexicalizing parse trees, i.e., annotating each 
phrase node with the lexical item (word) which governs that phrase: this is called the head word of the 
phrase. Fig. [4.1 1 shows an example of a lexicalized parse tree. To account for lexicalization, each non- 
terminal symbol in the grammar {e.g. Z) is split into many symbols, each annotated with a word that 
may govern that phrase {e.g. Z"'i, Z^"^, . . .). This greatly increases the size of the grammar, but it has a 
significant impact in performance. A string of work involving lexicalized PCFGs includes |Magerman| 
( ifgS) ; |Charniak| < [T997) ; |CollinsH[T999) . 



Discriminative models. Similarly as in sequence models (where it was shown how to move from an FIMM 
to a CRF), we may abandon our generative model and consider a discriminative one. An advantage of 
doing that is that it becomes much easier to adopt non-local input features {i.e., features that depend 
arbitrarily on the surface string), for example the kind of features obtained via lexicalization, and much 
more. The CKY parsing algorithm can still be used for decoding, provided the feature vector decompose 
according to non-terminal symbols and production rules. In this case, productions and non-terminals will 
have a score which does not correspond to a log-probability; the partition fimction and the posterior 
marginals can be computed with the inside-outside algorithm. See Taskar et al. { 2004) for an application 



of structured SVMs to parsing, and Finkel et al. < 2008| for an application of CRFs 



Latent variables. Splitting the variables in the grammar by introducing latent variables appears as an alterna- 
tive to lexicalization and parent annotation. There is a string of work concerning latent variable gram- 
mars, either for the generative and discriminative cases: [Petrov and Kleiri ( |2007 2008a b|. Some related 



work also considers coarse-to-fine parsing, which iteratively applies more and more refined models: 
IChamiak etaL] < |2006| . 



History-based parsers. Finally, there is a totally different line of work which models parsers as a sequence of 
greedy shift-reduce decisions made by a push-down automaton (Ratnaparkhi, , ^1999^ ^Henderson 2003^ . 
When discriminative models are used, arbitrary conditioning can be done on past decisions made by the 
automaton, allowing to include features that are difficult to handle by the other parsers. This comes at 
the price of greediness in the decisions taken, which implies suboptimality in maximizing the desired 
objective function. 



4.2 Dependency Parsing 

4.2.1 Motivation 

Consider again the sentence 

She enjoys the Summer school. 



along with the lexicalized parse tree displayed in Fig. 4.1 If we drop the phrase constituents and keep only 
the head words, the parse tree would become: 

enjoys 
she school 

the Summer 
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S (enjoys) 




V (enjoys) 
I 

enjoys 



NP (school) 



Det (the) 

I 

the 



Nbar (school) 



N (Summer) N (school) 
I 

Summer school 



Figure 4.1: A lexicalized parse tree for the sentence She enjoys the Summer school. 



This representation is called a dependency tree; it can be alternatively represented as shown in Fig. 4.2 De- 
pendency trees retain the lexical relationships involved in lexicalized phrase-based parse trees. However, they 
drop phrasal constituents, which render non-terminal nodes unnecessary. This has computational advantages 
(no grammar constant is involved in the complexity of the parsing algorithms) as well as design advantages 
(no grammar is necessary, and treebank annotations are way simpler, since no internal constituents need to 
be annotated). It also shifts the focus from internal syntactic structures and generative gra mmars |C homsky} 
[1965 ) to lexical and transformational grammars (Tesniere 1959 Hudson 1984 Melcuk 1988 Covingtonj|1990i . 
Arcs connecting words are called dependency links or dependency arcs. In an arc {h, m), the source word h is called 
the head and the target word m is called the modifier. 




^ She 



enjoys 



the Summer school 



Figure 4.2: A dependency tree for the sentence She enjoys the Summer school. Note the additional dummy root 
symbol (*) which is included for convenience. 



4.2.2 Projective and Non-projective Parsing 

Dependency trees constructed using the method just described (i.e., lexicalization of context-free phrase-based 
trees) always satisfy the following properties: 

1. Each word (excluding the dummy root symbol) has exactly one parent. 

2. The dummy root symbol has no parents. 

3. There are no cycles. 

4. The dummy root symbol has exactly one child. 

5. All arcs are projective. This means that for any arc (h,m), all words in its span (i.e., all words lying between 
h and m) are descendents from h (i.e. there is a directed path of dependency links from h to such word). 

Conditions 1-3 ensure that the set of dependency links form a well-formed tree, rooted in the dummy 
symbol, which spans all the words of the sentence. Condition 4 requires that there is a single link departing 
from the root. Finally, a tree satisfying condition 5 is said projective: it implies that arcs cannot cross (e.g., we 
cannot have arcs (h, m) and (h' , m') such that the word positions are ordered ash < h' < m < m'). 

In many languages (e.g., those which have free-order) we would like to relax the assumption that all trees 
must be projective. Even in languages which have fixed word order (such as English) there are syntactic 
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^ We learned a lesson in 1987 about volatility 



Figure 4.3: A non-projective parse tree. 



phenomena which are awkward to characterize using projective trees arising from the context-free assumption. 
Usually, such phenomena are characterized with additional linguistic artifacts (e.g., traces, Wh-movement, 
etc.). An example is the sentence (extracted from the Perm Treebank) 

We learned a lesson in 1987 about volatility. 

There, the prepositional phrase in 1987 should be attached to the verb phrase headed by learned (since this 
is when we learned the lesson), but the other prepositional phrase about volatility should be attached to the 
noun phrase headed by lesson (since the lesson was about volatility). To explain such phenomenon, context-free 
grammars need to use additional machinery which allows words to be scrambled (in this case, via a movement 
transformation and the consequent insertion of a trace). In the dependency-based formalism, we can get rid of 
all those artifacts altogether by allowing non-projective parse trees. These are t rees that satisfy conditions 1-3 
above, but not necessarily conditions 4 or 5rl The dependency tree in Fig. 4.3 is non-projective: note that the 
arc {lesson, about) is not projective. 

We end this section by mentioning that dependency trees can have their arcs labeled, to provide more 
detailed syntactic information. For example, the arc {enjoys, She) could be labeled as SUB J to denote that 
the modifier She has a subject function, and the arc {en joys, school) could be labeled as OBJ to denote that 
the modifier school has an object function. For simplicity, we resort to unlabeled trees, which just convey the 
backbone structure. To cope with the labels, one can use either a joint model that infers the backbone and 
labels altogether, or to have a two-stage approach that first gets the backbone structure, and then the arc labels. 



4.2.3 Algorithms for Projective Dependency Parsing 

We now turn our attention to algorithms for obtaining a dependency parse tree. We start by considering a 
simple kind of models which are called arc-factored. These models assign a score sg{h,m) to each possible 
arc {h,m) connecting a pair of words; they then score a particular dependency tree t by summing over the 
individual scores of the arcs that are present in the tree: 

scoree(0 = J2 ^eiKm). 

{h,m)et 

As usual, from the point of view of the parsing algorithm, it does not matter whether the scores come from 
a generative or discriminative approach, and which features were used to compute the scores. The three 
important inference tasks are: 

1. Obtain the tree with the largest score, 

t = argmaxscoree(f). 
t 

2. Compute the partition function (for a log-linear model), 

Z(se) = ^exp(scoree(f)), 
t 

where sg is short-hand notation for the set of all the sg{h, m) coefficients. 



''it is also common to impose conditions l-i, in which case the tree need not be projective, but it must have a single link departing from 
the root. The algorithms to be described below can be adapted for this case. 
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3. Compute the posterior marginals for all the possible arcs (which for a log-linear model is the gradient of 
the log-partition fimction), 

Pe{{h,m) e f = . ^ / ■ 

As explained in this morning's lecture, in projective dependency pars ing using arc -factored models, the 
three tasks above can be solved in time O(N^), through Eisner's algorithm ( [Eisner 1996| , which is represented 



as Algorithm 13 Eisner's algorithm is related with the CKY algorithm in the sense that it also parses a sentence 
rn a bottom-up fashion, building larger spans from smaller spans. In fact, the CKY algorithm can be adapted in 
a naive manner to perform projective dependency parsing by dropping the constituent symbols and keeping 
track of the head positions of each span. This naive strategy, however, would force us to manipulate five 
indices when combining two spans to form a larger span (the start and end points of the large span, the mid 
point, and the head positions for the two spans being combined), which would increase the complexity of the 
algorithm to O(N^). Eisner's algorithm uses the notion of complete and incomplete spans (in which all the right 
modifiers and all the left modifiers of a given head are all built separately) to reduce the complexity back to 
O ( ) . By replacing maximizations with summations, the same dynamic programming algorithm can be used 
to compute the partition function and the posterior marginals. 



Algorithm 13 Eisner 's algorithm for first-order projective dependency parsing 
1: input: Arc scores Se{h, m), tor h e {0, . . . , N}, m e {1, . . . , N}, and h ^ m, associated with a sentence 

2: {Initialization} 

3: for f = 0 to N do 

4: {Initialize incomplete spans.} 

5; incomplete := 0.0 

6: incomplete[f, f, ^] := 0.0 

7: 

8: {Initialize complete spans.} 

9: complete [f, i, ^] := 0.0 
10: complete[f, f, ^-j := 0.0 

11: end for 

12: 

13: {Induction} 

14: for A: = 1 to N do {A: is length of span} 
15: for s = 1 to N — A: do {s is start of span} 
16: Set t := s + k {t is end of span} 

17: 

18: {First, create incomplete spans.} 

19: incomplete [s, f, <—] := maXs<r<t(complete[s] [r] [^] -|- complete[7' -|- 1] [f] [^] +S0{t,s)) 

20: incomplete [s, f, := maxs<,-<f (complete [s] [r] [^] -|- complete[r -|- 1] [f] [<— ] +se{s,t)) 

21: 

22: {Then, create complete spans.} 

23: complete[s, <— ] := maxs<r<f (complete[s] [r] [<— ] -|- incomplete[r] [t] [<— ] 
24: complete[s, t,^] := maxs<r<t(incomplete[s] [r] [— )•] -|- complete[r] [t] [-^]) 
25: end for 
26: end for 

27: 

28: {Termination} 

29: Backtrack to obtain the actual tree, whose score is complete [0, N, ^-j. 



4.2.4 Algorithms for Non-Projective Dependency Parsing 

We turn our attention to non-projective dependency parsing. In that case, efficient solutions also exist for the 
three problems above; interestingly, they are based in combinatorial algorithms which are not related at all 
with d5mamic programming: 

• The first problem corresponds to finding the maximum weighted directed spanning tree in a directed graph. 
This problem is well known in combinatorics and can be solved in 0{N^) using Chu-Liu-Edmonds' 



77 



algorithm ( |Chu and Liu 1965 Edmonds 1967| p|This has first been noted by McDonald et al. (2005 1. 



The second and third problems can be solved by invoking another important result in combinatorics, the 
matrix-tree theore m (|Tutte 1984|. This fact has been noticed independently by [Smith and SmiS\ ( |2007| ; 



Koo et al. ( |2007| ; McDonald and Satta| ( |200 7|. The cost is that of computing a determinant and inverting 



a matrix, which can be done in time O(N^). The procedure is as follows. We first consider the directed 
weighted graph formed by including all the possible dependency links {h,m) (including the ones de- 
parting from the dummy root symbol, for which ^ = 0 by convention), along with weights given by 
exp{se{h, m)), and compute its (N + l)-by-(]V + 1) Laplacian matrix L whose entries are: 



f Y.h'=o'i^vise{h',m)), if h = m, 
\ — exp{sg{h,m))), otherwise. 



Denote by L the (0, 0)-mrnor of L, i.e., the matrix obtained from L by removing the first row and column. 
Consider its determinant det L and its inverse L^^. Then: 

- the partition function is given by 

Z(s9)=detL; (4.3) 

- the posterior marginals are given by 

L exp(S0(U,7njj ■ [L "J 

mm uiiieiwise. 

Exercise 4.3 In this exercise you are going to experiment with arc-factored non-projective dependency parsers. 

The CoNLL-X and CoNLL 2008 shared task datasets \Buchholz and Marsi\ \2006\ \Surdeanu et al.\ \2008^ contain 
dependency treebanksfor 14 languages. In this lab, we are going to experiment with the Portuguese and English datasets. 
We preprocessed those datasets to exclude all sentences with more than 15 words; this yielded the files: 

• data/ deppars /portuguese_train . con 11, 

• data/ deppars /portuguese_t est . con 11, 

• data/ deppars/english_train . conll, 

• data/ deppars /english_test .conll. 

1. After importing all the necessary libraries, load the Portuguese dataset: 



import sys 




sys . path . append (' . ' ) 




import Ixmls .parsing. dependency_parser a 


s depp 


dp = depp . DependencyParser ( ) 




dp . read_data ( "Portuguese " ) 





Observe the statistics which are shown. How many features are there in total? 
2. We will now have a close look on the features that can be used in the parser. Examine the file: 
code /par s i ng/ dependen cy_f eat ures . py. 
The following method takes a sentence and computes a vector of features for each possible arc {h,m) 



def create_arc_features(self, instance, h, m, add=False) : 

' ' 'Creates features for arc h — >m. ' ' ' 



We grouped the features in several subsets, so that we can conduct some ablation experiments: 



^There is a asymptotically faster algorithm by Tarjan 1977 which solves the same problem in O(N^) 
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• Basic features that look only at the parts-of -speech of the words that can he connected by an arc; 

• Lexical features that also look at these words themselves; 

• Distance/eflfwres that look at the length and direction of the dependency link (i.e., distance between the two 
words); 

• Contextual/eflfwres that look at the context (part-of-speech tags) of the words surrounding h and m. 

In the default configuration, only the basic features are enabled. The total number of features is the quantity 
observed in the previous question. With this configuration, train the parser by running 10 epochs of the structured 
perceptron algorithm: 

dp . train_perceptron (10) 
dp . test ( ) 



What is the accuracy obtained in the test set? (Note: the shown accuracy is the fraction of words whose parent was 
correctly predicted.) 

3. Repeat the previous exercise by subsequently enabling the lexical, distance and contextual features: 

dp . features . use_lexical = True 
dp . read_data ( "Portuguese " ) 
dp . train_perceptron (10) 
dp . test ( ) 

dp . features . use_distance = True 
dp . read_data ( "Portuguese " ) 
dp. train_perceptron (10) 
dp . test ( ) 

dp . features . use_contextual = True 
dp . read_data ( "Portuguese " ) 
dp . train_perceptron (10) 
dp . test ( ) 



For each configuration, write down the number of features and test set accuracies. Observe the improvements 
obtained when more features were added. 
Feel free to engineer new features! 

4. Which of the three important inference tasks discussed above (computing the most likely tree, computing the parti- 
tion function, and computing the marginals) need to be performed in the structured perceptron algorithm! What 
about a maximum entropy classifier, with stochastic gradient descent? Check your answers by looking at the fol- 
lowing two methods in code/ dependency_parser . py; 



def train _perceptron (self , n_epochs) : 

def train_crf_sgd(self, n_epochs, sigma, etaO = 0.001): 



Repeat the last exercise by training a maximum entropy classifier, with stochastic gradient descent, using A = 0.01 
and a initial stepsize ofrjo = 0.1: 

dp.train_crf_sgd(10, 0.01, 0.1) 
dp . test ( ) 
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Compare the results with those obtained by the perceptron algorithm. 
5. Train a parser for English using your favourite learning algorithm: 



dp . read_data ( " english ") 
dp . train_perceptron (10) 
dp . test ( ) 



The predicted trees are placed in the file data/deppars/english_test . conll .pred. To get a sense of 
which errors are being made, you can check the sentences that differ from the gold standard (see the data in 



data/deppars/english_test . conll) and visualize those sentences, e.g., in http : //www. ark . cs . 
\cmu . edu/treeviz/\ 

6. (Optional.) Implement Eisner's algorithm for projective dependency parsing. The pseudo-code is shown as Algo- 



rithm 13 Implement this algorithm as the function 



def parse_proj (self , scores): 
I I f 

Parse using Eisner's algorithm. 



in file dependency.decoder.py. The input is a matrix of arc scores, whose dimension is {N + l)-by-{N + 1), 
and whose {h,m) entry contains the score SQ{h,m). In particular, the first row contains the scores for the arcs that 
depart from the root, and the first column's values, along with the main diagonal, are to be ignored (since no arcs 
point to the root, and there are no self-pointing arcs). To make your job easier, we provide an implementation of the 
backtracking part: 

def backtrack_eisner(self, incomplete_backtrack, complete_backtrack, s, t, 
direction, complete, heads) : 



SO you just need to build complete /incomplete spans and their backtrack pointers and then call 
heads = -np . ones (N+1 , dtype=int) 

self .backtrack_eisner (incomplete_backtrack, complete_backtrack, 0, N, 1, 1, 

heads) 
return heads 



to obtain the final parse. 

To test the algorithm, retrain the parser on the English data (where the trees are actually all projective) by setting 
the flag dp. projective to True: 

dp = depp . DependencyParser ( ) 
dp . features . use_lexical = True 
dp . features . use_distance = True 
dp . features . use_contextual = True 
dp . read_data ( "english " ) 
dp.projective = True 
dp . train_perceptron (10) 
dp . test ( ) 



You should get the following results: 
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Number of sentences : 8044 
Number of tokens: 80504 
Number of words: 12202 
Number of pos : 4 8 
Number of features : 338014 
Epoch 1 

Training accuracy: 0 . 835637168541 
Epoch 2 

Training accuracy: 0 . 92242 6254 687 
Epoch 3 

Training accuracy: 0 . 947621628947 
Epoch 4 

Training accuracy: 0.96032 6602521 
Epoch 5 

Training accuracy: 0 . 967689840538 
Epoch 6 

Training accuracy: 0 . 97263631025 
Epoch 7 

Training accuracy: 0.97619370285 
Epoch 8 

Training accuracy: 0 . 979209016579 
Epoch 9 

Training accuracy: 0.98127569228 
Epoch 10 

Training accuracy: 0 . 981320865519 

Test accuracy (509 test instances) : 0.886732599366 



4.2.5 Model Refinements 



A number of refinements has been made that yield more accurate dependency parsers. We mention just a few: 

Sibling and grandparent features. The arc-factored assimiption fails to capture correlations between pairs of 
arcs. The dynamic programming algorithms for the projective case can be extended (at some additional 
cost) to handle features that look at consecutive sibling arcs on the same side of the head {i.e., pairs of arcs 
of the form {h, m) and {h, s) with h<m<sorh>m>s, such that no arc {h, r) exists with r between m 
and s. This has been done by Eisner and Satta ( |1999 L Similarly, grandparents can also be accommodated 
with similar extensions ( [Carreras 2007| . These are called "second-order models." 

For the non-projective case, however, any extension beyond the arc-factored model becomes NP-hard 



(McDonald and Satta 2007 1. Yet, approximate algorithms have been proposed to handle "second-order 



models" that seem to work well: a greedy method ( [McDonald et al. 20061, loopy belief propagation 
(Smith and Eisner 2008 1, a linear programming relaxation ( [Martins et aH|2009| ), and a dual decomposi- 
tion method ( [Koo et al. ' 



2010) . 



Third-order models. For the projective case, third order models have also been considered (Koo and Collins 
2010| |. This was extended to the non-projective case by Martins et al. ( 2013) . 



Transition-based parsers. Like in the phrase-based case, there is a totally different line of work which models 
parsers as a sequence of greedy shift-reduce decisions (Nivre et al. 2006 Huang and Sagae 2010). These 
parsers seem to be very fast (expected linear time) and only slightly less accurate than the state-of-the-art. 
Solutions have been worked out for the non-projective case also ( [Nivre 2009| |. 



4.2.6 External Links 

If you want to check a demo of a parser that use some of the extensions above, you can do so at 

[http : / fdemo . ark . cs . emu . edu/parse] 

This demo uses TurboParser ( [http : / /www . ark . cs . emu . edu/TurboParser) , a free software implementa- 
tion of a fast and accurate dependency parser that can parse thousands of tokens per second with state-of-the- 
art accuracies (way faster than the Python implementation that you have used at the labs). Feel free to try it at 
home! 
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Other well-known software toolkits that implement dependency parsers are: 

• MSTParser ( |http : / /www, seas . upenn . edu/ - strctlrn/MSTParser/MSTParser . html[ |, a graph- 
based dependency parser; 

• MaltParser ( http: //www .maltparser . org/) , a transition-based parser; 

• DP03 (http : //groups . csail .mit . e du/nlp/dpo3/) , a third-order projective parser; 

• Linear-time d5mamic programming parser ( |http : //acl . cs . go. edu/ -lhuang/[ |, a fast transition- 
based parser. 
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Days 

Big Data I: Introduction 



In this lab (and tomorrow), we will work with Amazon.com's Web Services (AWS|\[ a cloud based solution 
to run some simple analyses. Then, in the next lab, we will build on these tools to construct a larger learning 
system. 

We will only look at small problems, such that you can run them both locally and on AWS quickly. This way, 
you can learn how to use them within the limited time of these lab sessions. Unfortunately, this also means 
that you will not be dealing with truly large-scale problems where AWS is faster than local computations. You 
should consider these last two days as a proof-of-concept giving you the knowledge necessary to run things 
on AWS, which you can apply to your own large-scale problems after this summer school. 

Connecting to Amazon Web Services 

In these two lab days, you will use three machines: 

1. Your local machine, which may be your personal laptop or one of the computers at 1ST. From this machine, 
you will use SSH to login to: 

2. Your AWS machine, which is a regular Linux box which runs on the cloud and is private to you. You 
should use this machine to edit the code and test it. 

3. The Elastic MapReduce cluster machines. You cannot directly login to them, but will use automated tools 
to submit jobs and get back the results. 

Running at Home 

After the summer school is over, you can run the same code on your local machine or your own AWS machine. 
For this, you will need to get an Amazon account for yourself. Then, you will receive AWS credentials which 
will give you full permissions to run it in multiple modes. 

Today's Assignment 

Today we will show you how to solve a simple problem (counting words in documents) in a distributed 
manner. We will then ask you to implement a distributed solution for a problem you already know (Naive 
Bayes classification). 

5.1 Introduction 

AWS is a set of services of different types. We will focus on the Elastic Compute Cluster, EC2|^ 

In this lab, we will be logging in to Linux-based machines using a command line interface. If you are not 
familiar with the command line, don't worry - we will tell you everything you need regarding the command 
line, and your exercises only involve Python, just like the previous days. 

^http:/ /aws.amazon.com/ 
^http:/ /aws.amazon.com/ec2/ 
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5.2 Using your AWS machine 



You should have received a piece of paper with the following two pieces of information: 

1. A host name 

2. A password 

Your hostname will be something like ec2-54-234-117-69 . compute-1 . amazonaws . com and you 
should be able to log in with username ec2-user and the password you were given. This is your personal 
server running on Amazon's infrastructure! 

If you are using the command line, you can just type: 

ssh ec2-user@ec2-5 4-187-l-22 9. us -west -2 . compute . amazonaws . com 

(accept the RSA key and log in with the password we provided you). 

The machine has all of the necessary Python packages for you to work. It also has a standard collection of 
editors and other tools, including emacs and vi. If you're not used to the command line, you should probably 
use nano which is a very simple text editor. To open a file, type in the terminal 

nano your_f ilename 

From within nano, use CTRL-0 to save a file and CTRL-X to exit (you will be asked to save any modified 
files). Use CTRL-k to cut an entire line and CTRL-U to paste it back (to copy instead of cut, use cut, then paste 
in the original location, then paste in the other desired location). 

This machine is an Ubuntu based installation and you can access super user powers with sudo (if you're 
not familiar with Unix-based environments, you might not imderstand this last sentence - don't worry about 
it for now). 



5.3 MapReduce for Word Count 

The initial idea of MapReduce was an implementation by Googl^ but very soon Hadoop appeared as an 
open-source implementation of the same paradigm]^ 

The basic idea is to take your original problem, whichever it may be, and tackle it in two steps: 

1. Map step: Divide the data into several parts and send each part to a different computer. Each computer 
does some computation using only that part of the data, and returns some output]^ It comes from the 
functional programming world, and is actually present in Python as a bull tin function, map. 

2. Reduce step: Collect all the outputs from all the different computers and compute the final solution from 
those outputs. Again, this name comes from the functional programming and it is present in Python as 
the builtin reduce. However, as we will see the reduce step in MapReduce is a bit more intelligent than 
the traditional reduce. 

The classic example of application of MapReduce is the word count problem: 

• You have a collection of documents, there are many of them. 

• You want to count how many times each word appears in the whole collection. 

If the text corpus is small, this is trivial to do on a single machine. However, for big corpora, one computer 
alone would take a long time. For example, the whole English Wikipedia is about 10 GB compressed, and 
several times that when uncompressed. It would take a considerable amount of time to count the occurrence 
of each word on the whole dataset using only one computer. 

However, this problem is quite easy to parallelize using the MapReduce framework: 

1. You process each document by itself, counting how many times each word appears. This is the map step. 
Notice that each document can be processed in parallel, by multiple machines. The result for each file is 
a dictionary of a count for each word. 

2. You sum up the counts for each word to get the final coimt. This is the reduce step. Notice that you can 
now parallelize over words. 

^http://research.google.com/archive/mapreduce.html 
^http://hadoop.apache.org/ 

^The name "map" is completely unrelated to maximum a posteriori (MAP) inference which was introduced in day 1. 
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5.3.1 Keys, Values, and the MRJob package 



In fact, what we stated above is a bit of an over-simplification. In Map Reduce, the map step actually outputs 
zero or more pairs of the form (key, value). Let's suppose that you want to run the wordcount example in the 
following document which contains only two sentences: 

the center of Lisbon is the best part of the city 
the campus of 1ST is located near Alameda 

Suppose that you send each sentence to a separate worker. The output of the first worker (which is processing 
the first sentence) would be: 



Key 


Value 


"the" 


3 


"center" 


1 


"of" 


2 



and the output of the second worker (which is processing the second sentence) would be: 



Key 


Value 


"the" 


1 


"campus" 


1 


"of" 


1 



Keys can be anything you want: when counting words, it makes sense to have each worker use words as 
keys, and values as the counts of those words in the text that was passed onto that worker. 

In the reduce step, the reducer will receive these key/ value pairs and do something with them. In word 
coiinting, the natural operation for the reduce step is to sum the counts with the same key, in order to produce 
this: 



Key 


Value 


"the" 


4 


"center" 


1 


"campus" 


1 


"of" 


3 



Note that the reduce step also produces key/value pairs. As you can see, this last table is the correct output 
of the Word Count algorithm. 

In this tutorial, we will be using the package Mr Jokj^ In MRJob, you should define mapper and reducer 
functions which implement the map and reduce steps. Whenever you are ready to produce a key/value 
pair, you should use the Python instruction yield key, value, where key and value are the variables 
containing the key and valuej^ 

5.3.2 Running Word Count on one machine 

Here is an implementation of the Word Count algorithm in Python, using MRJob: 

# Import the necessary libraries: 
from mr job. job import MRJob 

class WordCount (MRJob) : 

'^Mrjob was developed by Yelp, the company behind the epinomious website; it is available as open source, under the Apache license 
- see http://pythonhosted.org/mrjob/ 

''Without going into too much detail, the yield instruction in Python is used for generators, which are list-like objects which are more 
memory efficient. Instead of storing all the values in memory like a list, a generator simply knows how to produce the next element in 
a sequence. For example, range ( 5 ) produces a list with the values 0 to 4. xrange ( 5 ) produces a generator which knows that its first 
value is 0 and that following values get incremented by 1 each time, up to and including 4, but these values are never stored together in 
memory. If you want to learn more about generators in Python, visit http://docs.python.Org/2/tutorial/classes.html - but you probably 
don't need to read that for this lesson. 
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def mapper (self , _, doc) : 
o = {} 

# Process the document 
for w in doc. split (): 

if w i c : 

c[w] += 1 

else : 

c[w] = 1 

# Now, output the results 
for w, c in c . items ( ) : 

yield w, c 

def reducer (self , key, cs): 
yield key, siiin(cs) 

wc = WordCountO 
wc . run ( ) 



This file already exists as wordcount .py (inside the wordcount folder). You will first run it on your 
Amazon server, without using Amazon EC2. To do that, s sh your machine 

ssh <username>@<pref ix> . compute . amazonaws . com 

authenticating with the credentials provided in this lab. Then, navigate to the wordcount folder and type this 
into a terminal: 

python wordcount. py en_perline0 01.txt > results.txt 

Open the results .txt file in your favorite text editor. Each line of this file contains a word and the 
corresponding count. 

Here are a few important points which you need to remember for the exercise of this session: 

1. By default, MRJob splits the input (the en_perline001.txt file) by lines. This means that each line 
may be sent to a different worker We have taken care of this detail for you, but when you use MRJob 
for your own projects, remember that the input is split by lines unless you configure it differently. In the 
future, when we say that we "split the input into documents", we will be referring to this split by lines. 

2. You must create a class and make it inherit from the MRJob class, as we did in the line 

class Wordcount (MRJob) : 



3. You should create functions with the special names mapper and reducer. These functions always have 
three inputs: self, which all class functions have, key, and value. In our case, the mapper's input is 
just text which doesn't have a key; in Python, it is common to use an underscore, _, to denote unused 
input /output arguments. 

4. The mapper function takes the input text (the doc argument), splits it into words using the command 
doc . split ( ) , and then counts the words in that text. 

5. The reducer function has a key argument which is just a string containing a word. It also has a cs 
argument, which is a generator object. For most purposes, you can treat this argument as a list of all the 
counts in all the documents of the word in the key argument. This sum is, naturally, the output of the 
Word Count algorithm for this word. 

We suggest that you spend a few minutes studying this simple code. Try putting the Python line import 
pdb ; pdb . set_trace ( ) at multiple places within the code to see what is contained in each variable. 
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5.3.3 Running Word Count on Amazon EC2 



We will now run the algorithm on your own private cluster. 

One of the advantages of MRJob is that you can use the exact same Python code as before, but changing 
a few arguments in the command line will run the Word Count on Amazon's computers instead of on the 
computer where the file is. This is what Amazon calls Elastic Map Reduce (EMR). 

To run the exact same code as before using EMR, type the following into the command line. 

python wordcount.py -r emr en_perline001.txt > results2.txt 

Since the two files don't have to have the same order for the words, it's hard to compare them visually. Try 
these two commands in the terminal, which will check how often you coimted the word will or any words 
containing will as a substring: 

grep will results.txt 
grep will results2.txt 

You should see that the word "will" appears 2151 times on the English file, with both commands. Similarly, 
"unwilling" appears 11 times. 

The file we used on the previous exercises is not very big (it contains around 0.1% of the text from the 
English Wikipedia), but running on a cluster of machines allows us to scale up enough to use larger files as 
input. For example, let's run the same on 10% of the Portuguese Wikipedia, using a file which is present on 
Amazon's S3 storage service: 

python wordcount.py -r emr s3 : / /lxmls-labs/pt_perlinel 0 . txt > results_PT.txt 
This should take about 3 minutes. 



5.4 Using Naive Bayes for Language Detection 

Language detection is surprisingly easy if you have enough data to train your system. In our case, we're 
going to use triplets of characters (or "trimers") as features. For example, if your whole training corpus is a 
sentence in English which reads "I love Lisbon" (length: 13 characters) and a sentence in Portuguese which 
reads "Adoro Lisboa" (length: 12 characters), you would say that you saw the following features: "1 1", " lo", 
"lov", "ove", "ve ", and so on in the English data, and "Ado", "dor", "oro", "ro ", and so on in the Portuguese 
data. Note the presence of whitespace on some of these trimers. 

You can now use the exact same Naive Bayes algorithm which you used for sentiment analysis on Day 1. 
Instead of classifying text into two classes (positive and negative) we're going to classify text into two other 
classes (Portuguese and English). Our training data will be parts of the Portuguese and English Wikipedias|^ 

The implementation of distributed Naive Bayes is very similar to counting words: 

1. The map step takes the whole training data of the Portuguese language and splits it into documents. The 
mapper function computes the frequency of each trimer on one document. 

2. The reduce step compiles the information from all the documents and sums the coimts of every Por- 
tuguese document. 

3. The previous two steps are repeated for the English training data. 

4. After the Map Reduce part, a post-processing step uses these counts with similar formulas as in Day 1 to 
classify new unseen text into the two classes: English or Portuguese. For convenience, we repeat these 
formulas below. 

Once you have the counts of trimer occurrences on each language, you need to estimate: 

• The prior probability of each language appearing at test time, P(PT) and P(EN). For simplicity, instead of 
using Maximum Likelihood estimation, let's assume that the users of your language detector are equally 
likely to try English and Portuguese sentences. Thus, P(PT) = P(EN) = 20 

*We will use 10% of the Portuguese Wikipedia and 1% of the English Wikipedia to ensure that you can complete this exercise in the 
time of this lab session. However, Amazon's infrastructure is easily capable of handling the whole Wikipedia. 

'You could also assume, as in Day 1, that the probability of a given language appearing at test time is proportional to the size of the 
training data of that language. Each assumption makes sense in different contexts. 
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• The probability of each feature (trimer) given the class, P{tj\c). We will again use Maximum Likelihood 
estimation, which means that the probability of trimer t given language c is equal to the number of 
times f occurred in documents of language c, divided by the total number of trimers in language c. 
Mathematically: 

M(f,-,PT) 

where n{tj,FT) is the number of times the trimer tj occurred in your Portuguese training corpus. A 
similar formula is used for P(ty|EN). 

Then, given a test sentence x, we can compute the following argmax for the two classes c = PT and c = EN: 

argmaxP(c|x) = argmax^ P(c)P(x|c) 
c 

= argmax^ P{c) Uj P{ij\c) 

= argmax^ n/^(f/|c) 
= argmax^ ^log(P(fy|c)) (5.2) 

In the first equality we used Bayes' Theorem. In the second one we used the assumption of conditional in- 
dependence of features given the class, which is at the core of Naive Bayes. In the third one, we used that 
P(PT) = P(EN) = 2/ thus the prior does not affect the argmax. In the last equation, we used the fact that 
the argmax is not affected by the application of a logarithm. Logarithms will prevent your program from 
encountering underflow errors when multiplying many numbers which are very small. 



5.5 Assignment 

Using the Word Count example we've given you above, implement the Naive Bayes language detector de- 
scribed above. You should do this in two parts: 

• Steps 1 to 3 (counting occurrences of trimers in train data, for both languages), should be run on EC2. 

• Step 4 should be run on your AWS server (it is quite fast even with just one computer). It should imple- 
ment the formula given in equation \5.2\ . 

You must run two jobs, one on a selection of the Portuguese Wikipedia (1%) and another on the English 
Wikipedia (only 0.1% as this is much larger). These files already exist on the LXMLS Toolkit as pt_perline 0 1 . txt 
and en_perlineO 0 1 . txt. They are small enough to run on your AWS machine only, and should already pro- 
duce a decent language detector. 

Steps 1-3 are very similar to the word count code which we gave you. The main difference is that you are 
splitting the document into trimers instead of words. Remember to use the syntax > en . count s . txt and > 
pt . counts . txt when you run the two jobs, to output the results to appropriate files. 

If your code is in file trimercount . py, inside the big_data folder, the two commands you need to type 
into the terminal to run things in your AWS machine are: 

python trimercount . py en_perline0 01.txt > en.counts.txt 

python trimercount . py pt_perline01.txt > pt.counts.txt 

Step 4 should use the two files en . count s . txt and pt . count s . txt to implement the language detector. 
This is all run on your AWS machine only - you don't need to use MRJob and you can use plain Python as 
you did in the previous lab sessions. If you don't have time to implement this part, you can use our own 
implementation in file postprocess . py. 

After you've managed to get things working on one machine only, you can easily run steps 1-3 on AWS: 

python trimercount . py -r emr en_perline001.txt > en.counts.txt 

python trimercount . py -r emr pt_perline01.txt > pt.counts.txt 
To check if things worked, try the command 
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grep acc en.counts.txt 

which shoidd tell you that the trimer acc appeared 3448 times in the English file. 

To test yoiir language detector, see if it classifies correctly some sentences which you make up. If you do 
not know Portuguese, here are a few sentences you can use to test: 

• Esperamos que estejam a gostar desta Escola. 

• Se tiverem qualquer comentario a fazer, por favor enviem-nos um email. 

• Os organizadores gostariam de agradecer aos monitores das aulas praticas pela grande ajuda na elaboragao 
desta aula. 
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Day 6 

Big Data II: Distributed EM 



Today's Assignment 

In the previous lesson, you learned the fundamentals of MapReduce and applied it to a simple classification 
problem (language detection, using the Naive Bayes classifier). 

Today, we're going to use MapReduce again to solve a trickier problem: using EM to perform unsupervised 
POS induction. 

Use the same login information you used yesterday to access your Amazon machine. 



6.1 Distributed EM 

Before you read this section, if you haven't already, please read section [Z7| about the non-distributed version 
of EM. If necessary, review that part of the guide, especially the pseudocode in Algorithm|9] 

EM is an iterative method: for T iterations, we have to alternate between the E-Step and the M-Step. Both 
steps can be done in a distributed manner; in this lesson, we're going to focus on a simple way to distribute 
the E-Step; the M-Step will be non-distributed (at first). To see how to distribute both steps in various config- 
urations see, e.g., Wolfe et al. ( |2008 1. 



How can we distribute the E-Step? Recall from equations 1 2.48^ -1 2.51 1 that the E-Step involves summing 



over m, where m indexes each datum in your dataset. In POS induction, m indexes every sentence. The 
important factor to understand is that each sequence can be processed completely independently of each other 
(these are the inner expressions) given the previous iterations' This will correspond to a map step. The sums 
over m will be the reduce step. 

In the last lesson, you already saw the Word Count and Naive Bayes algorithms. In these two examples, we 
counted something in the Map step and then summed them in the Reduce step. That is the intuition behind 
what we will do today: we will distribute our data to several workers in the map step, count things separately. 



then sum those counts in the reduce step. Everything else will be exactly the same as in section 2.7 
The most important concepts that you should get out of today's tutorial are: 

1. Since we can process each sequence independently, this is naturally a map step. 

2. A full Map /Reduce call will correspond to a single iteration of the algorithm. 

3. To run multiple iterations, you will need to run multiple MapReduce calls. 

4. For each call after the first, you will need to pass in the results of the previous call. 

This is a step up from yesterday's mode of working where a single Map /Reduce call would solve your 
problem. 

The structure of this lab tutorial is as follows: 

1. You will first look at the code we already provide for you. 

2. Based on it, we will run a version that is not distributed. 

3. We will convert this to a distributed version that is naive and very inefficient. 

4. We will then improve this version to be more efficient. 
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5. Finally, we will run this code on a large dataset. 



This is a rather large set of exercises - we decided to push you harder on this last day. Don't feel discouraged 
if you cannot complete everything. 

6.2 First MapReduce Version 

In your Amazon machine you will find the directory distributed_em. In it you will find a few data and 
code files: 

• sequences2 00 . txt contains the first 200 sequences. 

• word_tag_dict . pkl is an auxiliary file containing the dictionaries of words and tags. 

• emstep . py contains a skeleton of the code we will be writing. 

• A few other files we will use below (ignore them for the moment). 

We recommend that leave the file emstep . py alone and experiment with the code snippets we provide 



below in a separate script until you get to section 6.2.3 



6.2.1 Loading data 

The datafiles are in a different format from what was used in the previous EM tutorial because the simplest way 
to use Hadoop (which is the underlying framework of Amazon EC2) is make each mapper receive a single line of the file. 
To load a single line of the input file, we have provided in fUe emstep . py a function 



def load_sequence (line, word_dict, tag_dict): 
f f t 




seq = load_sequence ( s , word_dict, tag_dict) 




Load a sequence from a single line 




word_dict & tag_dict should be loaded from the file 


' word_tag_dict . pkl ^ 


Parameters 




s : str 




word_dict : diet 




tag_dict : diet 




Returns 




seq : Sequence object 
f f f 





which takes the input data and the metadata in the auxiliary file. Here is how you would use it to just coimt 
the number of tokens in the dataset: 



import pickle 

from emstep import load_sequence 

word_dict, tag_dict = pickle . load (open (' word_tag_dict . pkl ' ) 
n = 0 

for line in open (' sequences200 . txt ') : 

s = load_sequence (line, word_dict, tag_dict) 
n += len(s) 

print 'Nr of tokens: ', n 



The format of s is the same as Day 2. If you need to refresh your memory, try placing a line with import 
pdb; pdb. set _t race 0 after the call to the function to see the format of s. 
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6.2.2 Processing a single sequence 

We have also provided code for you to compute the statistics for a single sequence in the function 



def predict_sequence (sequence, hmm) : 

F I I 

log_likelihood, initial_counts , transition_counts, final_counts , \ 
emission_counts = predict_sequence (seq, hmm) 

Run forward-backward on a single sentence . 

Parameters 



seq : Sequence object 
hmm: HMM object 

Returns 



log_likelihood : float 
initial_counts : np.ndarray 
transit ion_counts : ndarray 
final_counts : ndarray 
emission_counts : ndarray 

I r r 



Here is how you could use it. First we need to initialize an HMM object: 



import Ixmls . sequences . hmm ■ .. 


hmmc 


import pickle 




word_dict, tag_dict = pickle 


load (open ( ' word_tag_dict . pkl ' ) ) 


hmm = hmmc . HMM (word_dict , tag_ 


_dict) 


hmm . initialize_random ( ) 





Here we also made use of the word_tag_dict .pkl file to get the dictionaries. Then, we initialized the 
HMM with a random initialization. 

Now, we can use the predict_sequence function: 

from emstep import load_sequence , predict_sequence 

for line in open ( ' sequences200 . txt ' ) : 

seq = load_sequence (line, word_dict, tag_dict) 
statistics = predict_sequence ( seq, hmim) 

print statistics 



It is important to note that predict_sequence is a pure functionl It does not change its inputs. This means 
that if you call it in a different order or in different machines, you will always get the same results. 

This also explains why we need to have the word dictionaries precomputed: if we built them as we go, 
then the order in which the sequences are processed would make a difference. Thus, we could not process 
the sequences in parallel. We will comment on this later when we see an alternative (more sophisticated 
implementation). In our case, we were able to compute the dictionaries using a simple Python script, but if the 
data was truly large, we could have written another MapReduce job to discover all the words in the dataset. 

If you look back to equations 1 2.48^ -1 2.51) you can see that predict_sequence is computing the value 
inside the outer sums. 



6.2.3 Combining Partial Results 

At this point, you should understand the code we provided. 

The goal of the next exercise is to write a function called combine_part ial s that can take aU the sequence 
statistics and output the final statistics. 



92 



import Ixmls . sequences . hmm as hmmc 
import pickle 

from emstep import load_sequence, predict_sequence, combine_partials 

word_dict, tag_dict = pickle . load (open (' word_tag_dict . pkl ') ) 
hmm = hmmc . HMM (word_dict , tag_dict) 
hmm . initialize_random ( ) 
statistics = [] 

for line xii open ( ' sequences200 . txt ' ) : 

seq = load_sequence (line, word_dict, tag_dict) 
statistics . append (predict_sequence (seq, hmim) ) 

final = combine_partials (statistics, hmm) 



Exercise 6.1 Write the function combine_partials. This function should take a list of all the statistics and an 
HMM object. It should modify the HMM object to reflect the results of all the sequences. 
Your function should perform some computations and then assign to the hmm object: 

def combine _partials (statistics, hmm): 
hmm. log_likelihood = ... 
hmm. initial_counts = ... 



A template is provided in emstep . py. 

Note that this separation into sequence statistics and combination does not really correspond to the expec- 
tation and maximisation steps. 

6.2.4 Using MapReduce 

The previous exercise resulted in code that was not distributed, but already had the map/ reduce structure we 
needed to make it work. For now, the reduce step is not distributed and will run on a single machine (this will 
be improved on in the next exercises). 

We will perform a complete mapreduce run for each iteration of EM that we compute. In order to run 
multiple iterations, we will run mapreduce repeatedly. 

In what follows, we will save our output to a file called hmm .pkl on each iteration and then load it from 
there on the next iteration. Naturally, the first iteration needs to be a special case and we initialize randomly 
that time. 



We will also output a Python object from our reduce method. For this, we need a bit of magic (which is 
filled in the template we provide and now explain): 



class EMStep (MRJob) : 




INTERNAL_PROTOCOL 


= PickleProtocol 


OUTPUT_PROTOCOL 


= PickleValueProtocol 


def reducer ( self , _ 


, partials) : 


yield ' result ' , 


a_python_ob ject 



By declaring our output protocol to be PickleValueProtocol, this means that we can emit a Python 
object in the reduce as the final output and it will be properly serialized to the output. 

Exercise 6.2 Based on your function combine_partials and the code we provided you, fill in the map and reduce 
steps. 

For the moment, your reduce function should have a single emission of the form yield ' result ' , hmm. Later, 
we will see more sophisticated methods. 

You may want to read the next few paragraphs before you start. 
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In order to test this code on the cluster, you need to run it with the following flags (we are also directing 
the output to the file next_iteration . pkl): 

python emstep.py \ 
-r local \ 

— f ile=word_tag_dict . pkl\ 
sequences200.txt > next_iteration .pkl 

The first flag (-r local) makes the code run on the present machine (the AWS machine), the second 
( — f ile=word_tag_dict .pkl) declares that the file word_tag_dict .pkl is needed to run the job. 

In order to run multiple iterations, you need to save the results of the previous iteration to a file, which can 
be loaded at startup. 

So, once you have run the code the first time, rename the output file by t5^ing this into the terminal: 

mv next_iteration . pkl hmm.pkl 



Notice that in the init function we check whether the file hrtim . pkl exists and load it if so: 



from OS 


import path 




if path 


exists ( ' hmm . pkl ' ) : 




hmm 


= pickle . load (open ( ' hmm . pkl ' ) . read ( 


. decode ( ' string-escape ' ) ) 


else : 






hmm 


= hmmc.HMM(word_dict, tag_dict) 




hmm 


initialize_random ( ) 





We needed to be careful with the string escapes in the above code, but otherwise were able to just rely on 
Python pickle to load the results. 

And you can now call it again {passing the hmm.pkl file on the command linel): 

python emstep.py \ 
-r local \ 

— f ile=word_tag_dict . pkl \ 
— f ile=hmm. pkl \ 

sequences200.txt > next_iteration .pkl 
Pitfall: Be careful to not over-write your HMM file! You need to perform two steps: 

1. Run the code, outputing to a temporary file. 

2. Rename the temporary file. 

The following is a mistake! 

python emstep.py \ 
-r local \ 

— f ile=word_tag_dict . pkl \ 
— file=hmm.pkl \ 
sequences200.txt > hmm.pkl 

This is a mistake because the hmm.pkl is now removed (to make space for the new output) before your 
program has a chance to read it! 

6.2.5 Running it on the cluster 

When you are happy with your code, you can now run it on the cluster: 

python emstep.py \ 
-r emr \ 

— f ile=word_tag_dict . pkl \ 
sequence s2 00 . txt 

The -r emr flags runs the code on Elastic Map Reduce now. Run it multiple times using — f ile=hmm . pkl 
to get the next iteration. 

This will take a few minutes per iteration, so do not run many iterations. We will be working on making the code 
faster before you can do so. 
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A note on dependencies 

This code depends on packages such as numpy or the Ixmls package (which contains the implementation of 
the HMM we are using and which you helped write on the lab tutorial of day 2). 

On the cluster we provide you, they are already installed, and you can just use them without any special 
effort. If you are attempting to use your own computer or reproduce the steps at home; please note that 
these packages need to be installed before hand. Check the documentation of mrjob on how to install python 
packages on the Elastic Map Reduce cluster if you want to run it on your own Amazon account. 

6.2.6 Introducing mapper.final 

While the code you wrote in the last exercise works correctly, it is horribly inefficient. The problem is that we 
output several matrices for each sequence. This results in too much communication between processes and the 
reduce step needs to load all these large matrices. 

In order to address this problem we need to introduce a new operation, the mapper_f inal. This is called, 
for each mapper, when all the input has been processed. 

You can perhaps imderstand this by imagining that mrjob is running the following loop (in Python pseudo- 
code): 

for key, value in input: 

job . mapper (key, value ) 
job .mapper_f inal () 



You can delay emission of partial results until the mapper_f inal step and then emit the partial output of 
every sequence this job processed. 

For example, here is how WordCount can be performed using mapper_f inal: 



class WordCount (MR Job) : 


def 


(init) : 




self . counts = { } 


def 


mapper (self , _, values): 




for word in values . split ( ) : 




if word in self. counts: 




self . counts [word] += 1 




else : 




self . counts [word] = 1 


def 


mapper_f inal (self) : 




for k in self. counts: 




yield k, self . counts [k] 


def 


reducer ( self , key, values) : 




yield key, suin(values) 



Note that the reduce method is still necessary: although each map job will emit the final result for all 
the documents it saw, we still need to combine the results from different map jobs (which rim on different 
machines and processed different documents). 

By only emitting results in the mapper_f inal method, we have converted the implementation to use a 
batch system: the dataset is partioned into a block for each processor, each processor works on that whole 
block, and the reduction is only made to combine the results of different processors. 

This cuts down the communication overhead drastically and also makes the reduce function be faster and 
less resource intensive. Now, it only needs to work with a single set of statistics per compute node instead 
of one for each input sequence. Resource usage for reduction now only depends on the number of machines 
used and not the size of the input. 

Exercise 6.3 Use mapper_f inal to improve your MapReduce implementation. Initialize the matrices and log likeli- 
hood to zero in the init constructor and update partial sums in the map function. Emit only in the mapper_f inal 

method. 
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The reduce function should not need to be changed at all. 

Use the previous naive version as a benchmark. The results should not change beyond a rounding error (they may 
change slightly). 

The implementation that you have now can now be run on a larger dataset. As we did for yesterday's lab, 
we have put the full datasets on S3 (you can read back on S3 on yesterday's guide). You can now test you code 
on the cluster using the full dataset: 

python emstep.py \ 
-r emr \ 

— f ile=word_tag_dict . pkl \ 

s3 : // Ixmls-labs/ all-sequences-f or-em . txt 

If you try this with your previous version, which did not use the mapper Jinal optimization; you will 
simply wait a long time before running out of resources. With this new version, it runs to completion. It might 
still take too long for you to be able to run many iterations before the lab is over, but now you can see that you 
could run this on a million sequences in a few hours if you had enough machines. See also the note on Hadoop 
latency and Big Data at the end of this chapter. 

6.2.7 Parallelizing Reduce* 

If you ran out of time to complete this next section, don't worry, this section is an advanced module and you have already 
seen the major take-home points of the tutorial with the previous exercise. 

Our code can still be improved in two ways: (1) there is a single reduce call, which does not take advantage 
of the fact that we have a cluster of machines; and (2) since the emission matrices are sparse, we are emitting 
large matrices with many zeros. 

The previous code also had reduce use resources that grow with the number of processes. This may still be 
too much if you use a thousand of machines. We will also avoid this problem. 

In order to parallelize reduce, we need to start emitting partial results at a more fine grained level. Here are 
the types of emissions we want to consider: 

1. The initial counts: this is per state. 

2. The final counts: this is per state again. 

3. The transition counts: this is for each pair of states (transition from A to B). 

4. The emission counts: this is for each pair of state and word. 

5. The log likelihood. 

In order to distinguish all of these, we will emit them as numbers with different keys. For example, the 
log likelihood will simply have the key "log likelihood", while the matrices will have keys which identify the 
matrix and the cell. 

The code below exemplifies how we could emit the log likelihood and the vector of final counts: 

yield 'log likelihood', log_likelihood 

for i in xrange (len ( counts ) ) : 

name = hmm. get_state_name (i) 
yield 'final '+name, counts [i] 



Note how we included the name of the state in the key. For the transition counts and emission counts, we 
will need a nested for loop: 

for i in xrange (transition . shape [ 0 ]) : 

for j in xrange (transition . shape [ 1 ]) : 

if transition[i, j] : # < IGNORE ZEROS 

name_i = hmm. get_state_name (i) 
name_j = hmm. get_state_name ( j ) 

yield 'transition %s %s ' % (name_i, name_j ) , transition [i, j ] 
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Again, we are including the names of the states in the key. The check for zeros is important as it is useless 
to output zero counts. If the universe of words is large, then those matrices will be sparse and we avoid useless 
computation and communication. 

Exercise 6.4 Look in the file emission_snippets . py. This contains the for loops above and more. 

Use these to improve your mapper_final function. Write the corresponding reduce function. Do not change the 
keys used for the emission as they are needed below (see the next paragraph after this exercise). 



Important: Now you should remove the OUTPUT_PROTOCOL declaration! See the note on section 6.2.4 on what 
this means. 



The resulting output is no longer a single HMM object which we can load from Python, but a text file 
which encodes all the matrices. If you have used our snippets exactly, you can also use the code in the function 
parse_hmm_f rom_output to parse this file and generate a new HMM object. 

Note that with this output method, we do not need to have the word dictionaries precomputed. Whereas 
previously, we relied on matrix indeces to keep our words apart, now we output the actual word and therefore, 
we could just process each sequence as it comes. 



6.2.8 A Note on Hadoop Overhead and Big Data 

As you probably noticed, Hadoop has a lot of overhead and each iteration takes a long time to start computing 
and finish running. In our case, this is a very significant part of the time it takes to run an iteration. 

Hadoop is heavy machinery: it takes a while to move, but then can be very powerful. The advantages 
are in the scaling: if we had a trillion sequences which would take thousands of computing hours to process, 
then the minute or so that it takes to start up would not matter and we would reap the benefits of working 
with hundreds, even thousands, of machines. We can say that Hadoop has high latency but can have high 
throughput as well. Thus, it is not very appropriate for small problems, but can scale to huge ones. 

Unfortunately, we only had a few hours in which you could work: including understanding the task, 
writing the code, debugging it and running it. Therefore, it was unfeasible to ask you to work on a problem 
with a million sequences which could only be tackled with the heavy machinery of Hadoop. We could not 
really work on very large problems. This was only a demo of what Big Data really is. 

However, the code you wrote at the end of the chapter is now perfectly scalable to any size project you 
want to tackle. The skills you learned can be applied to any web-scale problem. 
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